Abstract
Existing approaches to measuring writing performance are insufficient in terms of both technical adequacy as well as feasibility for use as a screening measure. This study examined the validity and diagnostic accuracy of several approaches to automated text evaluation as well as written expression curriculum-based measurement (WE-CBM) to determine whether an automated approach improves technical adequacy. A sample of 140 fourth grade students generated writing samples that were then scored using traditional and automated approaches and examined in relation to the statewide measure of writing performance. Results indicated that the validity and diagnostic accuracy for the best performing WE-CBM metric, correct minus incorrect word sequences, and the automated approaches to scoring were comparable, with automated approaches offering potentially improved feasibility for use in screening. Averaging scores across three time points was necessary, however, in order to achieve improved validity and adequate levels of diagnostic accuracy across the scoring approaches. Limitations, implications, and directions for future research regarding the use of automated scoring approaches for screening are discussed.
Similar content being viewed by others
References
Allen, L., Dascalu, M., McNamara, D. S., Crossley, S., & Trausan-Matu, S. (2016). Maodeling individual differences among writers using ReaderBench. In L. Gómez Chova, A. López Martínez, & I. Candel Torres (Eds.), EDULEARN16 proceedings. (pp. 5269–5279). IATED Academy.
Botarleanu, R. M., Dascalu, M., Sirbu, M. D., Crossley, S. A., & Trausan-Matu, S. (2019). ReadMEGenerating personalized feedback for essay writing using the ReaderBench framework. In H. Knoche, E. Popescu, & A. Cartelli (Eds.), Smart learning ecosystems and regional development 2018. (pp. 133–145). Springer.
Cook, B. G., Lloyd, J. W., Mellor, D., Nosek, B. A., & Therrien, W. J. (2018). Promoting open science to increase the trustworthiness of evidence in special education. Exceptional Children, 85, 104–118. https://doi.org/10.1177/0014402918793138.
Crossley, S. A., Bradfield, F., & Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research, 11, 251–270. https://doi.org/10.17239/jowr-2019.11.02.01.
Dascalu, M., Crossley, S. A., McNamara, D. S., Dessus, P., & Trausan-Matu, S. (2018). Please ReaderBench this text: A multi-dimensional textual complexity assessment framework. In S. D. Craig (Ed.), Tutoring and intelligent tutoring systems. (pp. 251–271). Nova Science.
Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232. https://doi.org/10.1177/001440298505200303.
Diedenhofen, B., & Musch, J. (2015). cocor: A comprehensive solution for the statistical comparison of correlations. PLoS ONE, 10, 1–12. https://doi.org/10.1371/journal.pone.0121945.
Espin, C., Shin, J., Deno, S. L., Skare, S., Robinson, S., & Benner, B. (2000). Identifying indicators of written expression proficiency for middle school students. The Journal of Special Education, 34, 140–153. https://doi.org/10.1177/002246690003400303.
Espin, C. A., Scierka, B. J., Skare, S., & Halverson, N. (1999). Criterion-related validity of curriculum-based measures in writing for secondary school students. Reading & Writing Quarterly: Overcoming Learning Difficulties, 15, 5–27. https://doi.org/10.1080/105735699278279.
Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, G. M., & Slider, N. J. (2002). Moving beyond total words written: The reliability, criterion validity, and time cost of alternate measures for curriculum-based measurement in writing. School Psychology Review, 31, 477–497.
Graesser, A. C., McNamara, D. S., Cai, Z., Conley, M., Li, H., & Pennebaker, J. (2014). Coh-Metrix measures text characteristics at multiple levels of language and discourse. The Elementary School Journal, 115, 210–229. https://doi.org/10.1086/678293.
Hanley, J. A., & McNeil, B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148, 839–843. https://doi.org/10.1148/radiology.148.3.6878708.
Hosp, M. K., Hosp, J. L., & Howell, K. W. (2016). The ABCs of CBM: A practical guide to curriculum-based measurement (2nd ed.). Guilford.
Keller-Margulis, M. A., Mercer, S. H., & Thomas, E. L. (2016). Generalizability theory reliability of written expression curriculum-based measurement in universal screening. School Psychology Quarterly, 31, 383–392. https://doi.org/10.1037/spq0000126.
Kim, Y. G., Schatschneider, C., Wanzek, J., Gatlin, B., & Al Otaiba, S. (2017). Writing evaluation: Rater and task effects on the reliability of writing scores for children in Grades 3 and 4. Reading and Writing: An Interdisciplinary Journal, 30, 1287–1310. https://doi.org/10.1007/s11145-017-9724-6.
Malecki, C. K., & Jewell, J. (2003). Developmental, gender, and practical considerations in scoring curriculum-based measurement writing probes. Psychology in the Schools, 40, 379–390. https://doi.org/10.1002/pits.10096.
McMaster, K. L., & Campbell, H. (2008). New and existing curriculum-based writing measures: Technical features within and across grades. School Psychology Review, 37, 550–556.
McMaster, K. L., & Espin, C. A. (2007). Technical features of curriculum-based measurement in writing. The Journal of Special Education, 41, 68–84. https://doi.org/10.1177/00224669070410020301.
McMaster, K. L., Lembke, E. S., Shin, J., Poch, A. L., Smith, R. A., Jung, P., Allen, A. A., & Wagner, K. (2020). Supporting teachers’ use of data-based instruction to improve students’ early writing skills. Journal of Educational Psychology, 112, 1–21. https://doi.org/10.1037/edu0000358.
Meng, X., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175. https://doi.org/10.1037/0033-2909.111.1.172.
Mercer, S. H. (2020). writeAlizer: Generate predicted writing quality and written expression CBM scores. (Version 1.2.0) [Computer software]. https://github.com/shmercer/writeAlizer/.
Mercer, S. H., Keller-Margulis, M. A., Faith, E. L., Reid, E. K., & Ochs, S. (2019). The potential for automated text evaluation to improve the technical adequacy of written expression curriculum-based measurement. Learning Disability Quarterly, 42, 117–128. https://doi.org/10.1177/0731948718803296.
National Center for Educational Statistics. (2012). The nation's report card: Writing 2011. Institute of Education Sciences, U.S. Department of Education. http://nationsreportcard.gov.
National Center on Intensive Intervention. (2018). Academic screening tools chart rating rubric. https://intensiveintervention.org/sites/default/files/NCII_AcademicScreening_RatingRubric_July2018.pdf.
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective. (pp. 43–54). Lawrence Erlbaum Associates.
Payan, A. M., Keller-Margulis, M., Burridge, A. B., McQuillin, S. D., & Hassett, K. S. (2019). Assessing teacher usability of written expression curriculum-based measurement. Assessment for Effective Intervention, 45, 51–64. https://doi.org/10.1177/1534508418781007.
Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21, 104–111. https://doi.org/10.1016/j.asw.2014.05.001.
Perin, D. (2020). Reading, writing, and self-efficacy of low-skilled postsecondary students. In D. Perin (Ed.), The Wiley handbook of adult literacy. (pp. 237–260). Blackwell: Wiley.
Philippakos, Z. A., MacArthur, C. A., & Coker, D. L. (2015). Developing strategic writers through genre instruction: Resources for grades 3–5. Guilford.
R Core Team. (2019). R: A language and environment for statistical computing. (Version 3.6.1) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/.
Rainie, L., & Anderson, J. (2017). Code-dependent: pros and cons of the algorithm age. Pew Research Center. http://www.pewinternet.org/2017/02/08/code-dependent-pros-and-cons-of-the-algorithm-age.
Ritchey, K. D., McMaster, K. L., Al Otaiba, S., Puranik, C. S., Kim, Y. G., Parker, D. C., & Ortiz, M. (2016). Indicators of fluent writing in beginning writers. In K. D. Cummings & Y. Petscher (Eds.), The fluency construct: Curriculum-based measurement concepts and applications. (pp. 21–66). Springer.
Robitzsch, A., & Grund, S. (2020). miceadds: Some additional multiple imputation functions, especially for 'mice'. (Version 3.9–14) [Computer software]. https://CRAN.R-project.org/package=miceadds.
Roebuck, D. B., Sightler, K. W., & Brush, C. C. (1995). Organizational size, company type, and position effects on the perceived importance of oral and written communication skills. Journal of Managerial Issues, 7, 99–115.
Romig, J. E., Therrien, W. J., & Lloyd, J. W. (2017). Meta-analysis of criterion validity for curriculum-based measurement in written language. The Journal of Special Education, 51, 72–82. https://doi.org/10.1177/0022466916670637.
Smolkowski, K., Cummings, K. D., & Strycker, L. (2016). An introduction to the statistical evaluation of fluency measures with signal detection theory. In K. D. Cummings & Y. Petscher (Eds.), The fluency construct: Curriculum-based measurement concepts and applications. (pp. 187–221). Springer.
Stevens, B. (2005). What communication skills do employers want? Silicon Valley recruiters respond. Journal of Employment Counseling, 42, 2–9. https://doi.org/10.1002/j.2161-1920.2005.tb00893.x.
Texas Education Agency. (2012a). State of Texas assessments of academic readiness: Grade 4 expository scoring guide spring 2012. https://tea.texas.gov/sites/default/files/staar-g4-ExpScorGde-spr2012.pdf.
Texas Education Agency. (2012b). State of Texas assessments of academic readiness: Grade 4 personal narrative scoring guide spring 2012. https://tea.texas.gov/sites/default/files/staar-g4Wtg-PerNarrScoreGde-Spr2012.pdf.
Texas Education Agency. (2012c). Technical digest 2011–2012. https://tea.texas.gov/student-assessment/testing/student-assessment-overview/technical-digest-2011-2012.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–67. https://doi.org/10.18637/jss.v045.i03.
Wilson, J. (2018). Universal screening with automated essay scoring: Evaluating classification accuracy in grades 3 and 4. Journal of School Psychology, 68, 19–37. https://doi.org/10.1016/j.jsp.2017.12.005.
Wilson, J., Chen, D., Sandbank, M. P., & Hebert, M. (2019). Generalizability of automated scores of writing quality in Grades 3–5. Journal of Educational Psychology, 111, 619–640. https://doi.org/10.1037/edu0000311.
Wilson, J., Olinghouse, N. G., McCoach, D. B., Santangelo, T., & Andrada, G. N. (2016). Comparing the accuracy of different scoring methods for identifying sixth graders at risk of failing a state writing assessment. Assessing Writing, 27, 11–23. https://doi.org/10.1016/j.asw.2015.06.003.
Wilson, J., Roscoe, R., & Ahmed, Y. (2017). Automated formative writing assessment using a levels of language framework. Assessing Writing, 34, 16–36. https://doi.org/10.1016/j.asw.2017.08.002.
Acknowledgements
This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A190100 awarded to the University of Houston (PI – Milena Keller-Margulis). The opinions expressed are those of the authors and do not represent views of the Institute or thae U.S. Department of Education.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Keller-Margulis, M.A., Mercer, S.H. & Matta, M. Validity of automated text evaluation tools for written-expression curriculum-based measurement: a comparison study. Read Writ 34, 2461–2480 (2021). https://doi.org/10.1007/s11145-021-10153-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11145-021-10153-6