-
Issue Information Journal of Educational Measurement (IF 1.056) Pub Date : 2021-03-16
Editor SANDIP SINHARAY, Educational Testing Service
-
Robust Estimation for Response Time Modeling Journal of Educational Measurement (IF 1.056) Pub Date : 2020-11-04 Maxwell Hong, Daniella A. Rebouças, Ying Cheng
Response time has started to play an increasingly important role in educational and psychological testing, which prompts many response time models to be proposed in recent years. However, response time modeling can be adversely impacted by aberrant response behavior. For example, test speededness can cause response time to certain items to deviate from the hypothesized model. In this article, we introduce
-
Simultaneous Constrained Adaptive Item Selection for Group‐Based Testing Journal of Educational Measurement (IF 1.056) Pub Date : 2020-10-18 Daniel Bengs, Ulf Kroehne, Ulf Brefeld
By tailoring test forms to the test‐taker's proficiency, Computerized Adaptive Testing (CAT) enables substantial increases in testing efficiency over fixed forms testing. When used for formative assessment, the alignment of task difficulty with proficiency increases the chance that teachers can derive useful feedback from assessment data. The application of CAT to formative assessment in the classroom
-
Issue Information Journal of Educational Measurement (IF 1.056) Pub Date : 2020-12-08
Editor SANDIP SINHARAY, Educational Testing Service
-
Robust Estimation of Ability and Mental Speed Employing the Hierarchical Model for Responses and Response Times Journal of Educational Measurement (IF 1.056) Pub Date : 2020-10-13 Jochen Ranger, Jörg‐Tobias Kuhn, Anett Wolgast
Van der Linden's hierarchical model for responses and response times can be used in order to infer the ability and mental speed of test takers from their responses and response times in an educational test. A standard approach for this is maximum likelihood estimation. In real‐world applications, the data of some test takers might be partly irregular, resulting from rapid guessing or item preknowledge
-
Statistical Theoreticians and Educational Assessment: Comments on Shelby Haberman's NCME Career Contributions Award Journal of Educational Measurement (IF 1.056) Pub Date : 2020-08-24 Robert J. Mislevy
In his 2019 NCME Career Contributions Award address, Dr. Shelby Haberman uses examples of three kinds to illustrate how his training in theoretical statistics influenced his contributions to educational measurement. I bracket my comments on his address, and his contributions more generally, by considering two questions: Why might any theoretical statisticians receive this award? Why aren't all recipients
-
Statistical Theory and Assessment Practice Journal of Educational Measurement (IF 1.056) Pub Date : 2020-08-17 Shelby J. Haberman
Examples of the impact of statistical theory on assessment practice are provided from the perspective of a statistician trained in theoretical statistics who began to work on assessments. Goodness of fit of item‐response models is examined in terms of restricted likelihood‐ratio tests and generalized residuals. Minimum discriminant information adjustment is used for linking with no anchors or problematic
-
Using a Projection IRT Method for Vertical Scaling When Construct Shift Is Present Journal of Educational Measurement (IF 1.056) Pub Date : 2020-08-13 Tyler Strachan, Uk Hyun Cho, Kyung Yong Kim, John T. Willse, Shyh‐Huei Chen, Edward H. Ip, Terry A. Ackerman, Jonathan P. Weeks
In vertical scaling, results of tests from several different grade levels are placed on a common scale. Most vertical scaling methodologies rely heavily on the assumption that the construct being measured is unidimensional. In many testing situations, however, such an assumption could be problematic. For instance, the construct measured at one grade level may differ from that measured in another grade
-
Standard Errors of Variance Components, Measurement Errors and Generalizability Coefficients for Crossed Designs Journal of Educational Measurement (IF 1.056) Pub Date : 2020-08-06 Rashid S. Almehrizi
Estimates of various variance components, universe score variance, measurement error variances, and generalizability coefficients, like all statistics, are subject to sampling variability, particularly in small samples. Such variability is quantified traditionally through estimated standard errors and/or confidence intervals. The paper derived new standard errors for all estimated statistics for two
-
Using Retest Data to Evaluate and Improve Effort‐Moderated Scoring Journal of Educational Measurement (IF 1.056) Pub Date : 2020-07-01 Steven L. Wise, Megan R. Kuhfeld
There has been a growing research interest in the identification and management of disengaged test taking, which poses a validity threat that is particularly prevalent with low‐stakes tests. This study investigated effort‐moderated (E‐M) scoring, in which item responses classified as rapid guesses are identified and excluded from scoring. Using achievement test data composed of test takers who were
-
A Recursion‐Based Analytical Approach to Evaluate the Performance of MST Journal of Educational Measurement (IF 1.056) Pub Date : 2020-06-30 Hwanggyu Lim, Tim Davey, Craig S. Wells
This study proposed a recursion‐based analytical approach to assess measurement precision of ability estimation and classification accuracy in multistage adaptive tests (MSTs). A simulation study was conducted to compare the proposed recursion‐based analytical method with an analytical method proposed by Park, Kim, Chung, and Dodd and with the more commonly used Monte Carlo (MC) simulation approaches
-
A Response Time Process Model for Not‐Reached and Omitted Items Journal of Educational Measurement (IF 1.056) Pub Date : 2020-05-04 Jing Lu, Chun Wang
Item nonresponses are prevalent in standardized testing. They happen either when students fail to reach the end of a test due to a time limit or quitting, or when students choose to omit some items strategically. Oftentimes, item nonresponses are nonrandom, and hence, the missing data mechanism needs to be properly modeled. In this paper, we proposed to use an innovative item response time model as
-
A Novel Partial Credit Extension Using Varying Thresholds to Account for Response Tendencies Journal of Educational Measurement (IF 1.056) Pub Date : 2020-04-01 Mirka Henninger
Item Response Theory models with varying thresholds are essential tools to account for unknown types of response tendencies in rating data. However, in order to separate constructs to be measured and response tendencies, specific constraints have to be imposed on varying thresholds and their interrelations. In this article, a multidimensional extension of a Partial Credit Model using a sum‐to‐zero
-
Using Natural Language Processing to Predict Item Response Times and Improve Test Construction Journal of Educational Measurement (IF 1.056) Pub Date : 2020-02-24 Peter Baldwin, Victoria Yaneva, Janet Mee, Brian E. Clauser, Le An Ha
In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an information‐retrieval‐based automatic question‐answering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction
-
A Latent Class Signal Detection Model for Rater Scoring with Ordered Perceptual Distributions Journal of Educational Measurement (IF 1.056) Pub Date : 2020-02-18 Lawrence T. DeCarlo, Xiaoliang Zhou
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring classes. The extension recognizes a different type of rater
-
A Framework for Measuring the Amount of Adaptation of Rasch‐based Computerized Adaptive Tests Journal of Educational Measurement (IF 1.056) Pub Date : 2020-02-18 Adam E. Wyse, James R. McBride
A key consideration when giving any computerized adaptive test (CAT) is how much adaptation is present when the test is used in practice. This study introduces a new framework to measure the amount of adaptation of Rasch‐based CATs based on looking at the differences between the selected item locations (Rasch item difficulty parameters) of the administered items and target item locations determined
-
Studying Score Stability with a Harmonic Regression Family: A Comparison of Three Approaches to Adjustment of Examinee‐Specific Demographic Data Journal of Educational Measurement (IF 1.056) Pub Date : 2020-02-18 Yi‐Hsuan Lee, Shelby J. Haberman
For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unforeseen. The situation is more challenging for assessments
-
Sensitivity of the RMSD for Detecting Item‐Level Misfit in Low‐Performing Countries Journal of Educational Measurement (IF 1.056) Pub Date : 2019-12-25 Jesper Tijmstra, Maria Bolsinova, Yuan‐Ling Liaw, Leslie Rutkowski, David Rutkowski
Although the root‐mean squared deviation (RMSD) is a popular statistical measure for evaluating country‐specific item‐level misfit (i.e., differential item functioning [DIF]) in international large‐scale assessment, this paper shows that its sensitivity to detect misfit may depend strongly on the proficiency distribution of the considered countries. Specifically, items for which most respondents in
-
Automated Test Assembly with Mixed‐Integer Programming: The Effects of Modeling Approaches and Solvers Journal of Educational Measurement (IF 1.056) Pub Date : 2019-11-19 Xiao Luo
Automated test assembly (ATA) is a modern approach to test assembly that applies advanced optimization algorithms on computers to build test forms automatically. ATA greatly improves the efficiency and accuracy of the test assembly. This study investigated the effects of the modeling methods and solvers in the mixed‐integer programming (MIP) approach to ATA in the context of assembling parallel linear
-
Linking via Pseudo‐Equivalent Group Design: Methodological Considerations and an Application to the PISA and PIAAC Assessments Journal of Educational Measurement (IF 1.056) Pub Date : 2019-11-14 Artur Pokropek, Francesca Borgonovi
This article presents the pseudo‐equivalent group approach and discusses how it can enhance the quality of linking in the presence of nonequivalent groups. The pseudo‐equivalent group approach allows to achieve pseudo‐equivalence using propensity score reweighting techniques. We use it to perform linking to establish scale concordance between two assessments. The article presents Monte‐Carlo simulations
-
A More Flexible Bayesian Multilevel Bifactor Item Response Theory Model Journal of Educational Measurement (IF 1.056) Pub Date : 2019-10-22 Ken A. Fujimoto
Multilevel bifactor item response theory (IRT) models are commonly used to account for features of the data that are related to the sampling and measurement processes used to gather those data. These models conventionally make assumptions about the portions of the data structure that represent these features. Unfortunately, when data violate these models' assumptions but these models are used anyway
-
Using Weighted Sum Scores to Close the Gap Between DIF Practice and Theory Journal of Educational Measurement (IF 1.056) Pub Date : 2019-10-22 Hongwen Guo, Neil J. Dorans
We make a distinction between the operational practice of using an observed score to assess differential item functioning (DIF) and the concept of departure from measurement invariance (DMI) that conditions on a latent variable. DMI and DIF indices of effect sizes, based on the Mantel‐Haenszel test of common odds ratio, converge under restricted conditions if a simple sum score is used as the matching
-
A New Statistic for Selecting the Smoothing Parameter for Polynomial Loglinear Equating Under the Random Groups Design Journal of Educational Measurement (IF 1.056) Pub Date : 2019-10-15 Chunyan Liu, Michael J. Kolen
Smoothing is designed to yield smoother equating results that can reduce random equating error without introducing very much systematic error. The main objective of this study is to propose a new statistic and to compare its performance to the performance of the Akaike information criterion and likelihood ratio chi‐square difference statistics in selecting the smoothing parameter for polynomial loglinear
-
Partial Identification of Answer Reviewing Effects in Multiple‐Choice Exams Journal of Educational Measurement (IF 1.056) Pub Date : 2019-10-14 Yongnam Kim
Does reviewing previous answers during multiple‐choice exams help examinees increase their final score? This article formalizes the question using a rigorous causal framework, the potential outcomes framework. Viewing examinees’ reviewing status as a treatment and their final score as an outcome, the article first explains the challenges of identifying the causal effect of answer reviewing in regular
-
Improving Item‐Exposure Control in Adaptive Testing Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-30 Wim J. van der Linden, Seung W. Choi
One of the methods of controlling test security in adaptive testing is imposing random item‐ineligibility constraints on the selection of the items with probabilities automatically updated to maintain a predetermined upper bound on the exposure rates. Three major improvements of the method are presented. First, a few modifications to improve the initialization of the method and accelerate the impact
-
Estimating the Accuracy of Relative Growth Measures Using Empirical Data Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-29 Katherine E. Castellano, Daniel F. McCaffrey
The residual gain score has been of historical interest, and its percentile rank has been of interest more recently given its close correspondence to the popular Student Growth Percentile. However, these estimators suffer from low accuracy and systematic bias (bias conditional on prior latent achievement). This article explores three alternatives—using the expected a posterior (EAP), conditioning on
-
Comparing the Accuracy of Student Growth Measures Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-29 Katherine E. Castellano, Daniel F. McCaffrey
Testing programs are often interested in using a student growth measure. This article presents analytic derivations of the accuracy of common student growth measures on both the raw scale of the test and the percentile rank scale in terms of the proportional reduction in mean squared error and the squared correlation between the estimator and target. The study contrasts the accuracy of the growth measures
-
IRT Approaches to Modeling Scores on Mixed‐Format Tests Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-12 Won‐Chan Lee, Stella Y. Kim, Jiwon Choi, Yujin Kang
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed‐format tests that consist of a mixture of multiple‐choice and free‐response items. Test scores on several mixed‐format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response
-
Item Selection and Exposure Control Methods for Computerized Adaptive Testing with Multidimensional Ranking Items Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-12 Chia‐Wen Chen, Wen‐Chung Wang, Ming Ming Chiu, Sage Ro
The use of computerized adaptive testing algorithms for ranking items (e.g., college preferences, career choices) involves two major challenges: unacceptably high computation times (selecting from a large item pool with many dimensions) and biased results (enhanced preferences or intensified examinee responses because of repeated statements across items). To address these issues, we introduce subpool
-
Exploring How to Model Formative Assessment Trajectories of Posing‐Pausing‐Probing Practices: Toward a Teacher Learning Progressions Framework for the Study of Novice Teachers Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-12 Brent Duckor, Carrie Holmberg
A robust body of evidence supports the finding that particular teaching and assessment strategies in the K‐12 classroom can improve student achievement. While experts have identified many effective teaching and learning practices in the assessment for learning literature, teachers’ knowledge and use of “high leverage” formative assessment (FA) practices are difficult to model in novice populations
-
A New Statistic to Assess Fitness of Cubic‐Spline Postsmoothing Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-11 Hyung Jin Kim, Robert L. Brennan, Won‐Chan Lee
In equating, smoothing techniques are frequently used to diminish sampling error. There are typically two types of smoothing: presmoothing and postsmoothing. For polynomial log‐linear presmoothing, an optimum smoothing degree can be determined statistically based on the Akaike information criterion or Chi‐square difference criterion. For cubic‐spline postsmoothing, visual inspection has been an important
-
Do Teachers Consider Advice? On the Acceptance of Computerized Expert Models Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-08 Esther Kaufmann, David V. Budescu
The literature suggests that simple expert (mathematical) models can improve the quality of decisions, but people are not always eager to accept and endorse such models. We ran three online experiments to test the receptiveness to advice from computerized expert models. Middle‐ and high‐school teachers (N = 435) evaluated student profiles that varied in several personal and task relevant factors. They
-
Integrating Multiple Sources of Validity Evidence for an Assessment‐Based Cognitive Model Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-08 Thomas Langenfeld, Jay Thomas, Rongchun Zhu, Carrie A. Morris
An assessment of graphic literacy was developed by articulating and subsequently validating a skills‐based cognitive model intended to substantiate the plausibility of score interpretations. Model validation involved use of multiple sources of evidence derived from large‐scale field testing and cognitive labs studies. Data from large‐scale field testing were evaluated using traditional psychometric
-
Logistic Regression Procedure Using Penalized Maximum Likelihood Estimation for Differential Item Functioning Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-08 Sunbok Lee
In the logistic regression (LR) procedure for differential item functioning (DIF), the parameters of LR have often been estimated using maximum likelihood (ML) estimation. However, ML estimation suffers from the finite‐sample bias. Furthermore, ML estimation for LR can be substantially biased in the presence of rare event data. The bias of ML estimation due to small samples and rare event data can
-
Examining the Precision of Cut Scores Within a Generalizability Theory Framework: A Closer Look at the Item Effect Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-08 Brian E. Clauser, Michael Kane, Jerome C. Clauser
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties
-
Classification Consistency and Accuracy With Atypical Score Distributions Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-05 Stella Y. Kim, Won‐Chan Lee
The current study aims to evaluate the performance of three non‐IRT procedures (i.e., normal approximation, Livingston‐Lewis, and compound multinomial) for estimating classification indices when the observed score distribution shows atypical patterns: (a) bimodality, (b) structural (i.e., systematic) bumpiness, or (c) structural zeros (i.e., no frequencies). Under a bimodal distribution, the normal
-
A Comparison of Aggregation Rules for Selecting Anchor Items in Multigroup DIF Analysis Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-05 Thorben Huelmann, Rudolf Debelak, Carolin Strobl
This study addresses the topic of how anchoring methods for differential item functioning (DIF) analysis can be used in multigroup scenarios. The direct approach would be to combine anchoring methods developed for two‐group scenarios with multigroup DIF‐detection methods. Alternatively, multiple tests could be carried out. The results of these tests need to be aggregated to determine the anchor for
-
Item Calibration Methods With Multiple Subscale Multistage Testing Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-04 Chun Wang, Ping Chen, Shengyu Jiang
Many large‐scale educational surveys have moved from linear form design to multistage testing (MST) design. One advantage of MST is that it can provide more accurate latent trait (θ) estimates using fewer items than required by linear tests. However, MST generates incomplete response data by design; hence, questions remain as to how to calibrate items using the incomplete data from MST design. Further
-
Bayesian Extension of Biweight and Huber Weight for Robust Ability Estimation Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-03 Hotaka Maeda, Bo Zhang
When a response pattern does not fit a selected measurement model, one may resort to robust ability estimation. Two popular robust methods are biweight and Huber weight. So far, research on these methods has been quite limited. This article proposes the maximum a posteriori biweight (BMAP) and Huber weight (HMAP) estimation methods. These methods use the Bayesian prior distribution to compensate for
-
Assessing and Validating Effects of a Data‐Based Decision‐Making Intervention on Student Growth for Mathematics and Spelling Journal of Educational Measurement (IF 1.056) Pub Date : 2019-09-02 Trynke Keuning, Marieke van Geel, Adrie Visscher, Jean‐Paul Fox
Data‐based decision making (DBDM) is presumed to improve student performance in elementary schools in all subjects. The majority of studies in which DBDM effects have been evaluated have focused on mathematics. A hierarchical multiple single‐subject design was used to measure effects of a 2‐year training, in which entire school teams learned how to implement and sustain DBDM, in 39 elementary schools
-
Examining the Dual Purpose Use of Student Learning Objectives for Classroom Assessment and Teacher Evaluation Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-29 Derek C. Briggs, Rajendra Chattergoon, Amy Burkhardt
The process of setting and evaluating student learning objectives (SLOs) has become increasingly popular as an example where classroom assessment is intended to fulfill the dual purpose use of informing instruction and holding teachers accountable. A concern is that the high‐stakes purpose may lead to distortions in the inferences about students and teachers that SLOs can support. This concern is explored
-
Can We Learn From Student Mistakes in a Formative, Reading Comprehension Assessment? Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-29 Bowen Liu, Patrick C. Kennedy, Ben Seipel, Sarah E. Carlson, Gina Biancarosa, Mark L. Davison
This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low‐scoring students based on patterns of mistakes, and a reading efficiency index. Instead of two response
-
Classroom Assessment and Large‐Scale Psychometrics: Shall the Twain Meet? (A Conversation With Margaret Heritage and Neal Kingston) Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-29 Margaret Heritage, Neal M. Kingston
Classroom assessment and large‐scale assessment have, for the most part, existed in mutual isolation. Some experts have felt this is for the best and others have been concerned that the schism limits the potential contribution of both forms of assessment. Margaret Heritage has long been a champion of best practices in classroom assessment. Neal Kingston has been involved with the application of psychometrics
-
Students’ Interpretation of Formative Assessment Feedback: Three Claims for Why We Know So Little About Something So Important Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-23 Jacqueline P. Leighton
If K‐12 students are to be fully integrated as active participants in their own learning, understanding how they interpret formative assessment feedback is needed. The objective of this article is to advance three claims about why teachers and assessment scholars/specialists may have little understanding of students’ interpretation of formative assessment feedback. The three claims are as follows.
-
Examining Psychometric Properties and Level Classification of the van Hiele Geometry Test Using CTT and CDM Frameworks Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-19 Yi‐Hsin Chen, Sharon L. Senk, Denisse R. Thompson, Kevin Voogt
The van Hiele theory and van Hiele Geometry Test have been extensively used in mathematics assessments across countries. The purpose of this study is to use classical test theory (CTT) and cognitive diagnostic modeling (CDM) frameworks to examine psychometric properties of the van Hiele Geometry Test and to compare how various classification criteria assign van Hiele levels to students. The findings
-
A General Framework for the Validation of Embedded Formative Assessment Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-19 Dorien Hopster‐den Otter, Saskia Wools, Theo J. H. M. Eggen, Bernard P. Veldkamp
In educational practice, test results are used for several purposes. However, validity research is especially focused on the validity of summative assessment. This article aimed to provide a general framework for validating formative assessment. The authors applied the argument‐based approach to validation to the context of formative assessment. This resulted in a proposed interpretation and use argument
-
Scoring Stability in a Large‐Scale Assessment Program: A Longitudinal Analysis of Leniency/Severity Effects Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-05 Corey Palermo, Michael B. Bunch, Kirk Ridge
Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent
-
Predicting Operational Rater‐Type Classifications Using Rasch Measurement Theory and Random Forests: A Music Performance Assessment Perspective Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-05 Brian C. Wesolowski
The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater‐type classifications based upon a Rasch analysis of raters’ differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9–12) were collected using a 29‐item Likert‐type rating scale embedded
-
Two IRT Fixed Parameter Calibration Methods for the Bifactor Model Journal of Educational Measurement (IF 1.056) Pub Date : 2019-08-01 Kyung Yong Kim
New items are often evaluated prior to their operational use to obtain item response theory (IRT) item parameter estimates for quality control purposes. Fixed parameter calibration is one linking method that is widely used to estimate parameters for new items and place them on the desired scale. This article provides detailed descriptions of two fixed parameter calibration methods for the bifactor
-
Accounting for Rater Effects With the Hierarchical Rater Model Framework When Scoring Simple Structured Constructed Response Tests Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-28 Ricardo Nieto, Jodi M. Casabianca
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension
-
A Two‐Stage Method for Classroom Assessments of Essay Writing Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-18 Stephen Mark Humphry, Sandy Heldsinger
To capitalize on professional expertise in educational assessment, it is desirable to develop and test methods of rater‐mediated assessment that enable classroom teachers to make reliable and informative judgments. Accordingly, this article investigates the reliability of a two‐stage method used by classroom teachers to assess primary school students’ persuasive writing. Stage 1 involves pairwise comparisons
-
Modeling Rater Response Processes in Evaluating Score Meaning Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-16 Suzanne Lane
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by
-
Conceptualizing Rater Judgments and Rating Processes for Rater‐Mediated Assessments Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-16 Jue Wang, George Engelhard
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments
-
Nonparametric Evidence of Validity, Reliability, and Fairness for Rater‐Mediated Assessments: An Illustration Using Mokken Scale Analysis Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-16 Stefanie A. Wind
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating
-
Pedagogical Considerations for Examining Rater Variability in Rater‐Mediated Assessments: A Three‐Model Framework Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-16 Brian C. Wesolowski, Stefanie A. Wind
Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations
-
Performance of Person‐Fit Statistics Under Model Misspecification Journal of Educational Measurement (IF 1.056) Pub Date : 2019-07-15 Seong Eun Hong, Scott Monroe, Carl F. Falk
In educational and psychological measurement, a person‐fit statistic (PFS) is designed to identify aberrant response patterns. For parametric PFSs, valid inference depends on several assumptions, one of which is that the item response theory (IRT) model is correctly specified. Previous studies have used empirical data sets to explore the effects of model misspecification on PFSs. We further this line
-
Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items Journal of Educational Measurement (IF 1.056) Pub Date : 2019-06-03 Yi‐Hsuan Lee, Shelby J. Haberman, Neil J. Dorans
In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information
-
Scale Alignment in Between‐Item Multidimensional Rasch Models Journal of Educational Measurement (IF 1.056) Pub Date : 2019-06-03 Leah Feuerstahler, Mark Wilson
Scores estimated from multidimensional item response theory (IRT) models are not necessarily comparable across dimensions. In this article, the concept of aligned dimensions is formalized in the context of Rasch models, and two methods are described—delta dimensional alignment (DDA) and logistic regression alignment (LRA)—to transform estimated item parameters so that dimensions are aligned. Both the
-
Item Response Models for Multiple Attempts With Incomplete Data Journal of Educational Measurement (IF 1.056) Pub Date : 2019-06-03 Yoav Bergner, Ikkyu Choi, Katherine E. Castellano
Allowance for multiple chances to answer constructed response questions is a prevalent feature in computer‐based homework and exams. We consider the use of item response theory in the estimation of item characteristics and student ability when multiple attempts are allowed but no explicit penalty is deducted for extra tries. This is common practice in online formative assessments, where the number
-
Comparing Academic Readiness Requirements for Different Postsecondary Pathways: What Admissions Tests Tell Us Journal of Educational Measurement (IF 1.056) Pub Date : 2019-06-03 Jeffrey T. Steedle, Justine Radunzel, Krista D. Mattern
Ensuring postsecondary readiness is a goal of K‐12 education, but it is unclear whether high school students should get different messages about the required levels of academic preparation depending on their postsecondary trajectories. This study estimated readiness benchmark scores on a college admissions test predictive of earning good grades in majors associated with middle‐skills occupations at
Contents have been reproduced by permission of the publishers.