Taking PISA Seriously: How Accurate are Low-Stakes Exams?

Akyol, Pelin; Krishna, Kala; Wang, Jinwen

doi:10.1007/s12122-021-09317-8

Taking PISA Seriously: How Accurate are Low-Stakes Exams?

Published: 26 March 2021

Volume 42, pages 184–243, (2021)
Cite this article

Journal of Labor Research Aims and scope Submit manuscript

Pelin Akyol¹,
Kala Krishna² &
Jinwen Wang³

1167 Accesses
10 Citations
13 Altmetric
1 Mention
Explore all metrics

Abstract

PISA is seen as the gold standard for evaluating educational outcomes worldwide. Yet, being a low-stakes exam, students may not take it seriously resulting in downward biased scores and inaccurate rankings. This paper provides a method to identify and account for non-serious behavior in low-stakes exams by leveraging information in computer-based assessments in PISA 2015. Our method corrects for non-serious behavior by fully imputing scores for items not taken seriously. We compare the scores/rankings calculated by our method to the scores/rankings calculated by giving zero points to skipped items as well as to the scores/rankings calculated by treating skipped items at the end of the exam as if they were not administered, which is the procedure followed by PISA. We show that a country can improve its ranking by up to 15 places by encouraging its own students to take the exam seriously and that the PISA approach corrects for only about half of the bias generated by the non-seriousness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Impact of PISA Studies on the Italian National Assessment System

Competition Exams and Topper Praise in India: A Media Case Study in Educational (In)equality

Leave them kids alone! National exams as a political tool

Article 23 April 2021

João Pereira dos Santos, José Tavares & José Mesquita

Notes

Other well known low-stakes tests include Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS). PISA assesses whether students can apply what they have learned to solve “real world” problems. PIRLS and TIMSS are grade-based (4th and 8th graders) and curriculum oriented.
The previous work in this field (Zamarro et al. 2019; Huang et al. 2012) have used the term “careless answering/responding” instead of “non seriousness”.
One item is one question. We use the word “item” or “question” interchangeably in the paper.
See the article in The Guardian, May 6, 2014, entitled “OECD and Pisa tests are damaging education worldwide-academics”, Retrieved from the following link: https://www.theguardian.com/education/2014/may/06/oecd-pisa-tests-damaging-education-academics
See the article in the National, Sept 25, 2017, entitled “Abu Dhabi pupils prepare for Pisa 2018”, Retrieved from the following link: https://www.thenational.ae/uae/abu-dhabi-pupils-prepare-for-pisa-2018-1.661627
In the appendix, we provide some data that suggests that the results we see in the science section will likely be correlated to and magnified in the reading and math sections as non-seriousness seems more prevalent in math and reading than in science.
One might argue that students do not understand that it is better to guess than to skip. However, if this was the only reason for skipping, then skipping behavior should not be related to the position of the item, which clearly is as shown below. One might also argue that as this is a computer based test, students cannot go back to answer skipped item as they might in a paper test. If students do not realize this, they may skip inadvertently. Again, since they will quickly learn that they cannot go back even if they do not know this to begin with, skipping should be less prevalent in the second cluster than the first. Again, the opposite is true.
Note that observable characteristics also include several proxies for non-cognitive skills, such as test anxiety and achieving motivation. See Table 16 for the full list of variables used in the imputation
SENA is short for Skipped at the End Not Administered, which is the procedure followed by PISA.
One of their measures of effort is the extent to which performance falls when the question occurs later in the exam. Another is the extent to which questions are skipped in the survey that students have to fill out and a third is the extent of carelessness in filling the survey.
Our estimates below also show that China seems to be less affected by non serious behavior than the US.
In the 2012 PISA exam, 32 countries/regions were invited to complete both a paper and a computer version of mathematics test. However, by 2015, 58 moved to a computer based assessment. Jerrim (2016, 2018) find that taking the PISA exam in a computer-based mode affects students’ performance negatively in many countries.
For countries that choose to implement the assessment of financial literacy, it requires an additional 60 minutes.
For more detail see PISA 2015 Technical Report Chapter 2. OECD (2015b)
One complex multiple choice question includes several yes-or-no questions.
Some countries also have parent and teacher questionnaire.
Note that in the imputation we impute both multiple choice and open response questions. Our logic is that if the student did not answer the question because he did not know the answer, the imputation procedure is likely to give a score of zero for the question.
There may have been technical issues that prevented them from taking the exam. In any case, there is no way for their responses to be imputed as there is no information.
There are roughly 60 minutes allocated for the two science clusters which have in total an average of 31 questions.
This is similar to Kuhfeld and Soland (2019) in which a student is flagged as disengaged if over 10% of his or her item responses were rapid.
Note that students who skipped open response questions in the middle of the exam, even if they spent very little time on them, were not seen as non serious. They could have equally well been labeled as non-serious. However, such open questions, which are both not answered and spent too little time on, only account for 0.7% of the total questions, so we are not worried this will affect our results.
This is indeed an issue as high-ability students (those with high scores) have a higher fraction correct for too-little-time items than that for normal-time ones, while the opposite is true for low-ability students.
To calculate time spent on two clusters we should add time spent on position 1 and 2 or add time spent on position 3 and 4.
We did not plot time spent on the last 3 items for missing-item students because they miss these items by definition
Note that students satisfying criterion 3 have on average 15 more minutes left.
To do so we regress time spent on each item on type of question (multiple choice or open ended), position within a cluster and position of the cluster. We then remove the effect of question type, position and cluster to get the residual for each student and question. We plot the residuals for correct and incorrect answers for serious and non-serious students. We do not include individual fixed effects in the regression as we wish to see how serious and non-serious students differ in their responses.
Serious students spend 19.5 minutes per cluster while non-serious ones spend 17.8 minutes per cluster.
We only impute too little time items for students who satisfy Criterion 4.
We did include both multiple choice and open response items in criterion 3 (missing) and criterion 4 (too little time items)
This choice is unlikely to make a difference as only 0.7% students have less than 1 minute left, and 3% have less than 5 minutes left.
If this were not so, there would be no similar individuals/items/schools to impute from.
In the imputation, we categorize partial credit answers as wrong answers for simplicity. On average students have only 8% of questions in their exams which allow partial credit.
There are 36 random numbers in total which determine the specific science clusters assigned to students. Moreover, students have science clusters either in the first two sessions or in the last two sessions. Therefore in total there are 72 groups within which students answer the same questions in the same order.
See page 148 of OECD (2015b), chapter 9.
This is also the practice used by Gneezy et al. (2019)
To quote PISA (page 149 of OECD (2015a))
“Omitted responses prior to a valid response are treated as incorrect responses; whereas, omitted responses at the end of each of the two one-hour test sessions in both PBA and CBA are treated as not reached/not administered.”
These numbers differ slightly from the numbers in the original working paper posted as we used sampling weights for each student in this version and not in the earlier one. The ranks do not change across the versions.
Recall that non-serious items include non-reached, no-response and missing items, and items with too little time if a student spends too little time on at least three items and the fraction correct for little-time items is lower than that for normal-time ones. Here we also include open response items which are non-reached or no-response.
These three regression lines add up to the 45⁰ line.
Imputed number correct is calculated by taking the mean of ten draws of number correct.
Note that they do not account for skipped items in the middle and too little time items.
We used the 5 questions in ST118 for the anxiety variable and the 5 questions in ST119 for the ambition variable.
We calculate the stakes of standardized tests given in school as follows. In school questionnaire, school principles were asked whether the school used standardized tests for 11 different purposes. We mark the stake of each purpose to be between 1 to 3 and sum up the stakes for each school. Then we sort countries by their mean stakes and mark the top 36 countries as high-stake countries while the remaining 36 countries are marked as low-stake ones.
Our results are robust to using fraction correct on items that are answered seriously as a measure of ability.
See column 1 and the row for time on classes and out-of-school science learning.
For each cluster, the predicted probability at each level of question difficulty in the figure takes the mean value of the predicted probability at that level of difficulty.

References

Attali Y, Neeman Z, Schlosser A (2011) Rise to the challenge or not give a damn: Differential performance in high vs. low stakes tests
Azmat G, Calsamiglia C, Iriberri N (2016) Gender differences in response to big stakes. J Eur Econ Assoc 14(6):1372–1400
Article Google Scholar
Azur M J, Stuart E A, Frangakis C, Leaf P J (2011) Multiple imputation by chained equations: What is it and how does it work?. Int J Methods Psych Res 20(1):40–49
Article Google Scholar
Baumert J, Demmrich A (2001) Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. Eur J Psychol Educ 16(3):441
Article Google Scholar
Borghans L, Schils T (2012) The leaning tower of pisa: decomposing achievement test scores into cognitive and noncognitive components. Unpublished manuscript
Borgonovi F, Biecek P (2016) An international comparison of students’ ability to endure fatigue and maintain motivation during a low-stakes test. Learn Individ Differ 49:128–137
Article Google Scholar
Butler J, Adams R J (2007) The impact of differential investment of student effort on the outcomes of international studies. J Appl Measur 8(3):279–304
Google Scholar
Cole J S, Bergin D A, Whittaker T A (2008) Predicting student achievement for low stakes tests with effort and task value. Contemp Educ Psychol 33(4):609–624
Article Google Scholar
Duckworth A L, Quinn P D, Lynam D R, Loeber R, Stouthamer-Loeber M (2011) Role of test motivation in intelligence testing. Proc Natl Acad Sci 108(19):7716–7720
Article Google Scholar
Eklöf H (2010) Skill and will: test-taking motivation and assessment quality. Assess Educ Principles Policy Practice 17(4):345–356
Article Google Scholar
Eklöf H, Pavešič B J, Grønmo L S (2014) A cross-national comparison of reported effort and mathematics performance in timss advanced. Appl Meas Educ 27(1):31–45
Article Google Scholar
Finn B (2015) Measuring motivation in low-stakes assessments. ETS Res Rep Ser 2015(2):1–17
Article Google Scholar
Gneezy U, List J A, Livingston J A, Qin X, Sadoff S, Xu Y (2019) Measuring success in education: the role of effort on the test itself. Amer Econ Rev Insights 1(3):291–308
Article Google Scholar
Hanushek E A, W ößmann L (2006) Does educational tracking affect performance and inequality? differences-in-differences evidence across countries. Econ J 116(510):C63–C76
Article Google Scholar
Hanushek E A, Link S, Woessmann L (2013) Does school autonomy make sense everywhere? panel estimates from pisa. J Dev Econ 104:212–232
Article Google Scholar
Huang J L, Curran P G, Keeney J, Poposki E M, DeShon R P (2012) Detecting and deterring insufficient effort responding to surveys. J Bus Psychol 27(1):99–114
Article Google Scholar
Jacob B A (2005) Accountability, incentives and behavior: The impact of high-stakes testing in the chicago public schools. J Publ Econ 89 (5-6):761–796
Article Google Scholar
Jalava N, Joensen J S, Pellas E (2015) Grades and rank: Impacts of non-financial incentives on test performance. J Econ Behav Organ 115:161–196
Article Google Scholar
Jerrim J (2016) Pisa 2012: How do results for the paper and computer tests compare?. Assess Educ Principles Policy Practice 23(4):495–518
Article Google Scholar
Jerrim J, Micklewright J, Heine J-H, Salzer C, McKeown C (2018) Pisa 2015: how big is the ‘mode effect-and what has been done about it?. Oxf Rev Educ 44(4):476–493
Article Google Scholar
Krosnick J A, Narayan S, Smith W R (1996) Satisficing in surveys: Initial evidence. Direct Eval 1996(70):29–44
Article Google Scholar
Kuhfeld M, Soland J (2019) Using assessment metadata to quantify the impact of test disengagement on estimates of educational effectiveness. J Res Educ Effect:1–29
Kuhfeld M, Soland J (2020) Using assessment metadata to quantify the impact of test disengagement on estimates of educational effectiveness. J Res Educ Effect 13(1):147–175
Google Scholar
Lavy V (2015) Do differences in schools’ instruction time explain international achievement gaps? evidence from developed and developing countries. Econ J 125(588):F397–F424
Article Google Scholar
Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49(4):764–766
Article Google Scholar
Lounkaew K (2013) Explaining urban–rural differences in educational achievement in thailand: Evidence from pisa literacy data. Econ Educ Rev 37:213–225
Article Google Scholar
OECD (2015a) Pisa 2015 results(volumn 1): Excellence and equity in education. Technical Reprto, OECD
OECD (2015b) Pisa 2015 technical report. Technical Reprto, OECD
Penk C, Richter D (2017) Change in test-taking motivation and its relationship to test performance in low-stakes assessments. Educ Assess Eval Account 29(1):55–79
Article Google Scholar
Pintrich P R, De Groot E V (1990) Motivational and self-regulated learning components of classroom academic performance. J Educ Psychol 82 (1):33
Article Google Scholar
Prince Edward Island (2002) Preparing students for pisa (mathematical literacy): Teacher’s handbook. Technical Report, Prince Edward Island
Schafer J L, Graham J W (2002) Missing data: Our view of the state of the art. Psychol Methods 7:147–177
Article Google Scholar
Schnipke D L, Scrams D J (1997) Modeling item response times with a two-state mixture model: A new method of measuring speededness. J Educ Meas 34(3):213–232
Article Google Scholar
Wise S L, DeMars C E (2005a) Low examinee effort in low-stakes assessment: Problems and potential solutions. Educ Assess 10(1):1–17
Article Google Scholar
Wise S L, Kong X (2005b) Response time effort: A new measure of examinee motivation in computer-based tests. Appl Meas Educ 18(2):163–183
Article Google Scholar
Wise S L (2006a) An investigation of the differential effort received by items on a low-stakes computer-based test. Appl Meas Educ 19(2):95–114
Article Google Scholar
Wise S L, DeMars C E (2006b) An application of item response time: The effort-moderated irt model. J Educ Meas 43(1):19–38
Article Google Scholar
Wise S L, Ma L (2012) Setting response time thresholds for a cat item pool: The normative threshold method. In: annual meeting of the National Council on Measurement in Education, Vancouver, Canada
Wise S L, Soland J, Bo Y (2020) The (non) impact of differential test taker engagement on aggregated scores. Int J Test 20(1):57–77
Article Google Scholar
Wolf L F, Smith J K (1995) The consequence of consequence: Motivation, anxiety, and test performance. Appl Meas Educ 8(3):227–242
Article Google Scholar
Zamarro G, Hitt C, Mendez I (2019) When students don’t care: Reexamining international differences in achievement and student effort. J Hum Cap 13(4):000–000
Article Google Scholar

Download references

Acknowledgments

We are grateful to participants at the Econometrics Society World Congress in 2020, Econometric Society meetings in Shanghai, China in 2018, International Association of Applied Econometrics Conference in Cyprus in 2019, Conference of the European Society for Population Economics (ESPE) in Bath, UK in 2019 and 9th ifo Dresden Workshop on Labor Economics and Social Policy in 2019. We would particularly like to thank Joris Pinkse, Keisuke Hirano, and Kim Ruhl for their comments and suggestions and Meghna Bramhachari for help in proofreading. We owe special thanks to colleagues at the OECD for answering our numerous questions about the data. Huacong Liu was instrumental in our working on this project, and we thank her for all her help. We are responsible for all errors.

Author information

Authors and Affiliations

Bilkent University, Ankara, Turkey
Pelin Akyol
Penn State University, CES-IFO and NBER, State College, PA, USA
Kala Krishna
Bates White Economic Consulting, Washington, DC, USA
Jinwen Wang

Authors

Pelin Akyol
View author publications
You can also search for this author in PubMed Google Scholar
Kala Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Jinwen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pelin Akyol.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix:

This appendix delves into more detail on a number of peripheral facts and issues. In the first part we present some non causal regressions on who are non serious and which questions tend to be taken non seriously.

We use the data on questions in the Science clusters only in the body of the paper. Our reason for doing so is that all students take two Science clusters, but may not be tested in Math or Reading. One might ask whether the patterns in other parts of the exam are similar. As a check we look at the fraction of non serious items across subjects in the third part of the A. Their similarity reassures us that our focus on the Science clusters is warranted.

In the fourth part of the A, we discuss in more detail the behavior patterns of serious and non-serious students in terms of time spent and accuracy of response as a function of question position.

In the fifth part, we discuss the exact variables we use in the imputation procedure. In the sixth part, we explain some details behind the decomposition for partially serious students and present the results for them.

A.1 What Drives Being Non-serious?

We have seen in Section 3 that serious and non-serious students behave very differently. The next question is, what factors correlate with being non-serious? We explore this in two levels. First, we look at the correlates of individuals being non-serious. After this, we look at correlates of the question not being taken seriously.

Table 7 Summary statistics

Full size table

A.1.1 Summary Statistics

In this section we explain the definition of various factors which are potentially correlated with non-serious behavior. Table 7 gives the descriptive statistics for these factors. Scores in the component parts of the exam (reading, math and science) are scaled so that 500 is the mean and the standard deviation is 100 for all OECD countries together. Clearly, OECD countries do better than average as the mean math and science scores overall are 464 and 474 respectively. Students are on average in the 10th grade and half the students are female. The variable “anxiety” is an index we constructed by taking questions that asked about this subject (where the ranking was from a “1” to a “4” in terms of strength of the viewpoint where 1 strongly disagree and 4 is strongly agree) and taking a simple average of the response. The median is 2.8 suggesting a fair degree of anxiety on the part of students. Similarly for “ambition” where the median response is 3.2.^{Footnote 42} The variable skipping class/arriving late uses the response for the three questions in ST062 about skipping, its intensity and arriving late and adds them up. A 1 is never in the last two weeks, a 2 is 1 or 2 times and a 3 is 3 or 4 times, and a 4 is 5 or more times. On average, such behavior exists but is not endemic.

The median time spent learning out of school is 16 hours per week, while time spent learning in school is 27 hours per week. Students spend more than 40 hours a week on school related work. The standard deviations are roughly 15 and 11 suggesting that a fair number of students are spending well over 60 to 70 hours a week on such work. Standardized test frequency and teacher developed test frequency is the response to question SC034. A response of 1 means there were no such tests and a response of 5 means the tests were given more than monthly. The median value is 2 or the frequency was 1-2 times a year. The variable “Stakes of standardized (teacher developed) tests comes from the answers to SC035. The question is composed of 11 yes/no sub-questions (where a yes is a 1 and a 0 is a no) regarding the purpose of these tests. We label each purpose as low, medium or high stakes for the students giving them a weight of 1, 2 and 3 respectively. Of the 11 sub-questions, 5 are low, 3 are medium and 3 are high stakes. We then add these weighted responses up to get our index. As the maximum value the index could have taken is 20, the median of 10 and 13 suggest the stakes are high, especially of teacher developed tests.

A.1.2 Who is Non-serious?

The factors that correlate with a student being non-serious are explored in Table 8. Column 1 shows the results for all countries. The dependent variable is 1 if the student is non-serious. In columns 1 to 3, being non-serious is defined as meeting at least one of criterion 1, 2 or 4. In column 4, being non-serious is defined as meeting criterion 3. We make this distinction because the patterns explored in the previous section differ across these two groups. We also look at high-stake countries, ones where the standardized tests given in school are high-stakes^{Footnote 43}, as well as low-stake countries as the patterns in the two might be different. If, for example, students are fed up with exams in high-stakes countries while not in low-stakes countries, we might expect a higher probability of being non-serious in PISA in high-stake countries. One might want to do these regressions country by country, but with 58 countries, this would be overkill as this is not the main object of this paper. Also note that we are not claiming any causal effects, merely pointing out some correlations in the data.

Table 8 Factors related to being non-serious

Full size table

To begin with, we ask whether better students are more or less likely to be non-serious. Columns 1-3 suggest that higher math scores (a proxy for ability) are associated with a student being less likely to be non-serious, except when we use criterion 3, suggesting that students with missing items are a different breed.^{Footnote 44} Students with high socioeconomic status (ESCS) and in lower grades are more likely to be non-serious. Again the sign in column 4 is reversed. This suggests that poor able students use criterion 3 when they are non serious while the rest use criterion 1, 2 or 4.

Students from richer countries are more likely to be non-serious, though the shape is that of an inverted U with a turning point at about $33,000 for per capita GDP. However, this pattern is again reversed in column 4 where the pattern is U shaped with a turning point at about $38,500.

Gender matters: women are less likely to be non-serious in columns 1-3, but are more likely to be non-serious (by quitting in the middle of the exam) in column 4 suggesting that women “blow off” the exam in different ways than men. As might be expected, being anxious or ambitious is associated with being less likely to be non-serious, while being undisciplined, i.e., having a pattern of skipping class or arriving late, is associated with being non-serious.

One might speculate that students who are over-worked and over-tested, especially with high-stakes exams, have test fatigue and passively resist taking yet another test, and therefore are more likely to not take PISA seriously. There is some evidence in favor of this. First, countries with high-stakes exams do seem to make students work harder. The data reveals that on average students spend 1.3 hours more per week in class and 3.1 hour more on out-of-school learning in all subjects in high-stakes countries relative to low-stakes ones. Working harder seems to be associated with not taking PISA seriously. In column 1, spending more time on studies out of school is significant for all countries together, but the effect seems to be coming from high stakes countries. Time spent in school is positive but not significant for all countries together, but is significant for low stake countries.^{Footnote 45} Having more tests (standardized or teacher-developed) does seem to correlate positively with being non-serious overall, though the coefficients are not significant. This might be because the effects differ in high stake and low stake countries. Having more standardized tests raises the likelihood of being non-serious in high-stakes countries (column 2) but does the opposite in low-stakes ones (column 3).

When teacher-developed tests are being given, raising the stakes seems to make students more likely to be serious, not less, suggesting that such testing may be less likely to result in test fatigue. Students from better schools, as reflected in the log of the school science score, are also less likely to be non-serious in low stakes countries, but more likely to be non-serious in high stakes countries. This makes sense if better schools push students more in high stakes countries resulting in fatigue. In Table 9 and Table 10, we present correlates of each non-seriousness criterion for cutoff level of 10% (as defined in Section 3) and 6%, respectively. The results are consistent across different cutoff levels.

Our results here should be seen as preliminary as there is no causation implied, merely correlation. The patterns described above are suggestive and might be worth exploring in future work.

Table 9 Factors related to being non-serious for each criterion (for cutoff level of 10%)

Full size table

Table 10 Factors related to being non-serious for each criterion (for cutoff level of 6%)

Full size table

A.1.3 Which Questions are Not Taken Seriously?

We define a non serious question as those that were not reached, for which there was no response, were missing, or on which too little time was spent. Non-reached and no-response items were looked at by the student who then chose not to answer the item despite having time left. Had he taken the exam seriously, he would have answered to the best of his ability. The student did not even look at missing items. But he had time left. In general, students have ample time to do the exam. Not even bothering to even read the question is again an indication of non seriousness. One might argue that no-response items, i.e., those that were skipped in the middle of the exam, should be treated differently as this was a computer based exam and students could not go back. Assuming they knew this, their choosing to skip again indicates the question is not taken seriously. Questions on which too little time was spent (as explained in criterion 4 for defining non serious students) are those where the response time is below a threshold which is country specific and for which the proportion correct is lower than that for normal time items for the same person. This is to prevent us from mistakenly labeling a question as non serious when in fact the student knew the answer immediately and so spent little time on it.

We explore the effects of question characteristics on the probability of a question being skipped, i.e., being not-reached or no-response. We also do the same for the probability of too little time being spent on a question. In both cases we run a linear probability model with individual fixed effects as well as question characteristics. Figure 7 shows the predicted probability of skipping a question and the predicted probability of spending too little time on a question for each cluster as a function of the difficulty of the question.^{Footnote 46} In all clusters, as the difficulty of the question increases, the probability of skipping increases though there is a slight decrease as questions become very difficult(top panel). In the bottom panel, we see that the probability of spending too little time is roughly flat: first increasing, then decreasing and finally increasing again. Students seem to try to answer if the question is easy but as it gets difficult, they seem to give up. There are also differences between clusters. Consistent with the “fatigue” hypothesis, questions are more likely to be taken non seriously in the second and fourth clusters.

Table 11 Factors affecting Pr(Skip) and Pr(Spend too little time) (Individual Characteristics)

Full size table

In Fig. 8, we explore whether question type affects the probability of skipping or spending too little time as a function of question order. For all questions, the probability of skipping rises with order, or sequence, in a cluster and jumps down at the beginning of the new cluster and more so after the break, which is consistent with “fatigue”. The graph of complex multiple choice questions for the probability of skipping lies between the open response and simple multiple choice questions. This makes sense as it is easy to guess an answer for simple multiple choice questions so that they are less likely to be skipped.

Non-serious behavior in terms of spending too little time weakly falls with the order within a cluster for all question types. However, there is a large jump up at the beginning of the second and fourth clusters. The above pattern suggests that for open response questions at least, as the exam proceeds, students substitute towards skipping with a reset at the end of each cluster. Hence we see a fall with sequence within a cluster and a jump up in each new cluster. While skipping is more likely for open response questions, spending too little time is less likely for such questions relative to other question types.

In order to understand the effects of individual characteristics on the probability of being skipped or spending too little time, we run individual characteristics on estimated individual fixed effects from our linear probability model, see Table 11. The results are in line with those of Table 8.

So far we ran choice regressions as if they were independent. However, the appropriate model is a multinomial choice one as the student has three mutually exclusive and exhaustive options for each question: skip, answer with too little time or answer with normal time. We used the linear probability model as it allowed us to incorporate individual fixed effects, which we could not do with logit. With logit, we can control for individual characteristics, but as we are unlikely to have information on all possible characteristics, we might have omitted variable bias.

Table 12 presents the results of a logit regression where the baseline choice is spending normal time answering the item. In the regression, we control for the question characteristics and the individual characteristics used in the previous tables. The first and second columns show the factors affecting the probability of skipping and the probability of spending too little time, respectively. The position within a cluster is positively correlated with the probability of skipping and negatively correlated with the probability of spending too little time, consistent with students switching from spending too little time to skipping as the exam progresses. If a question is in the second, third or fourth cluster relative to being in the first cluster, it is more likely to be skipped and this likelihood is much higher in the second and fourth clusters as they are the last clusters in each science session. Open response and complex multiple choice questions move students towards skipping and away from spending too little time. However, as the difficulty of the questions increase, the students become more likely to skip and spend too little time. The coefficients on individual characteristics are roughly in line with those in Table 8. The math score of the student is negatively correlated with the probability of skipping and the probability of spending too little time. Female students are less likely to skip or spend too little time. Ambitious students are less likely to skip. Consistent with our previous findings, students from richer countries are more likely to skip and spent too little time, though the shape is that of an inverted U with a turning point at about $43,000 for per capita GDP. We control for standardized test frequencies and teacher developed test frequencies to investigate whether there is any evidence that students are fed up with testing, and as a result do not take them seriously. We find that as the frequency of the standardized tests increases, students likelihood of skipping and spending too little time significantly increases which is consistent with the “fatigue” effect. However, the teacher-developed tests do the exact opposite. This suggests that students view them very differently.

Table 12 Factors affecting Pr(Skip) and Pr(Spend too little time) (Logit results)

Full size table

A.2 Fraction of Non-serious items Across Subjects

Table 13 shows the fraction of no-response items and the fraction of non-reached items for the subject of science, reading and math. The fraction of no response items for the reading and math tests are a bit higher on average than science. Moreover, the fraction of no-response and non-reached items are highly correlated across subjects. For example, the correlation between the fraction of no-response items for science and for reading is 0.98, showing that non seriousness is common across subjects of the test as might be expected.

Table 13 Fraction of non-serious items across subjects

Full size table

Table 14 Time per science cluster (minutes)

Full size table

A.3 Time Spent, Accuracy and Position

Table 14 shows time per science cluster across positions for serious and non-serious students. Note that time spent on the cluster falls with the position of the cluster and then jumps back up after the break at the end of cluster 2 and this is more so for non-serious students. There is substantial heterogeneity between non-serious students according to the criterion used. Students with no-response or too-little-time items, not surprisingly, spend less time per cluster than serious students regardless of cluster position. However, the opposite holds for those with non-reached or missing items but only for the first and third clusters. For the second and fourth clusters their time spent is 30-40% less than that of serious students. It is also worth noting that for these students, time is still not a constraint: on average they have more than 15 minutes left. This suggests that “fatigue” sets in faster for non-serious students.

The upper part of Table 15 shows proportion correct for all items (not just answered ones) across positions. Serious students have higher proportion correct than each category of non-serious students. Accuracy falls in the second cluster compared to the first one, and this is more so for non-serious students, reminiscent of the patterns for time spent. However, non-serious students will have a lower proportion correct on all items by definition as they skip many items. If we want to know what their accuracy is we should divide by the number of answered questions as done in the lower part of Table 15. The numbers show that even with this correction non-serious students have lower accuracy than serious ones. In addition, the degree to which accuracy falls across clusters is now similar (around 2%) for both serious and non-serious students. This is consistent with non-serious students’ performance experiencing a substantial drop in the second cluster primarily because they skip more items there.

Table 15 Proportion correct in science clusters

Full size table

A.4 Variables Used in Imputation

PISA data has a rich array of information from the student and school questionnaires in the survey. In the imputation we use variables constructed from these surveys by PISA. We choose the variables that seem relevant. A list of the variables used is contained in Table 16. Binary variables are clearly identified. All others are continuous indices. Details of these are available in the PISA technical report, OECD (2015b), Chapter 16. The imputation also uses the individual’s scores for all other items and other students’ scores for all items as in the standard MICE imputations. We also include country fixed effect in the imputations.

Table 16 Variables used in imputation

Full size table

A.5 Decomposition for Partially Serious Students

We call fully serious students those who neither skip items nor spend too little time on any item. These fully serious students, together with what we call partially-serious students, make up what we have termed serious students. For fully serious students, the number correct will be the same before and after imputation by definition. The increase in fraction correct for serious students (Y_s) therefore only comes from imputations for partially serious students who did skip a few items or spent too little time on a small enough number of items so that they were not classified as non-serious. There are PS partially serious students. Next we will decompose Y_s into its component parts.

$$ \begin{array}{@{}rcl@{}} Y_{s} &=&\frac{\sum \limits_{i\in S}I_{i}}{\sum\limits_{i\in S\cup NS}T_{i}} \\ &=&\frac{\sum\limits_{i\in PS}\left( I_{i}\right) }{\sum\limits_{i\in PS}NI_{i}}\frac{\sum\limits_{i\in PS}NI_{i}}{\sum\limits_{i\in PS}T_{i}} \frac{\sum\limits_{i\in PS}T_{i}}{\sum\limits_{i\in S\cup NS}T_{i}} \\ &=&A_{ps}E_{ps}P_{ps} \end{array} $$

A_ps is the increase in the fraction correct for non-serious items among partially serious students. E_ps is the fraction of non-serious items among all items for partially serious students, which measures the degree of non-seriousness. P_ps approximately measures the proportion of partially serious students in a country as partially serious students on average have the same number of total items as other students. The values of Y_ps, A_ps, E_ps and P_ps for each country are provided in Table 17.

Table 17 Decomposed factors for partially serious students

Full size table

Similar to the decomposition for non-serious students, we divide both sides by the geometric mean and get

$$ y_{ps}=\frac{Y_{ps}}{\bar{Y}_{ps}}=\left( \frac{A_{ps}}{\bar{A}_{ps}}\right) \left( \frac{E_{ps}}{\bar{E}_{ps}}\right) \left( \frac{P_{ps}}{\bar{P}_{ps}} \right) =a_{ps}e_{ps}p_{ps} $$

(7)

Take the logarithm on both sides of Eq. 7 gives:

$$ \ln (y_{ps})=\ln a_{ps}+\ln e_{ps}+\ln p_{ps} $$

(8)

Next we run the regression of $\ln a_{ps}$, $\ln e_{ps}$, $\ln p_{ps}$ separately on $\ln y_{ps}$, that is,

$$ \begin{array}{@{}rcl@{}} \ln a_{ps} &=&\alpha_{2}\ln y_{ps}+\epsilon_{a} \\ \ln e_{ps} &=&{\upbeta}_{2}\ln y_{ps}+\epsilon_{d} \\ \ln p_{ps} &=&\gamma_{2}\ln y_{ps}+\epsilon_{p.} \end{array} $$

Let the OLS estimates be denoted by $\hat {\alpha }_{2},\hat {\upbeta }_{2},\hat { \gamma }_{2}$. Similarly we can show that $\hat {\alpha _{2}}+\hat {{\upbeta }_{2}}+ \hat {\gamma _{2}}=1\ $and the coefficients $\hat {\alpha _{2}}$, $\hat {{\upbeta }_{2}}$, $\hat {\gamma _{2}}$ measure the contribution of partially serious students’ ability, extent of non-seriousness and proportion to a country’s increase in fraction correct. Figure 9 plots the scatter plot and regression lines above for partially serious students.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akyol, P., Krishna, K. & Wang, J. Taking PISA Seriously: How Accurate are Low-Stakes Exams?. J Labor Res 42, 184–243 (2021). https://doi.org/10.1007/s12122-021-09317-8

Download citation

Accepted: 10 February 2021
Published: 26 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s12122-021-09317-8

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Taking PISA Seriously: How Accurate are Low-Stakes Exams?

Abstract

Access this article

Similar content being viewed by others

The Impact of PISA Studies on the Italian National Assessment System

Competition Exams and Topper Praise in India: A Media Case Study in Educational (In)equality

Leave them kids alone! National exams as a political tool

Notes

References

Acknowledgments