Introduction

Revealing and Assessing Student Science Practice Knowledge

Recent discussions of K-16 STEM education have focused on the disciplinary knowledge and practices students should develop and use in novel situations (e.g., Seraphin et al., 2013; Wild et al., 2018; Windschitl et al., 2007). Within science these include conducting experiments, understanding and using models, and organizing, analyzing, and interpreting data (e.g., Kjelvik and Schultheis, 2019; Kuhn, 2010; Lehrer, Schauble and Lucas, 2008; Lehrer, Schauble, and Petrosino, 2001). The move from analog to digital work environments has occurred at a fast rate, but the implications of that move have lagged behind technology advancement and adoption. Some research on learning by environment exists. For instance Cromley et al. (2020) recently published a meta-analysis showing that drawing-to-learn tends to improve learning outcomes only when students drew by hand, but not when using digital drawing tools, and other studies have similarly found differences suggesting the environment in which students work can affect learning (e.g., Mueller and Oppenheimer, 2014; Sinclair and Yurita, 2008). In the specific context of graphing, it has been argued that the benefits of computer tools in graph generation (i.e., high degree of accuracy and ease in generating alternative displays for examination) may unintentionally lead students to be less intentional and reflective in their graphing decisions (e.g., selecting appropriate variables) (Tairab and Khalaf Al-Naqbi, 2004). It has been recommended that students approach graph construction in phases which includes exploring the data and planning how to best represent them based on the task or purpose for graphing, sketching graphs out by hand, and finally using technological tools for data visualization and meaning making (Patterson and Leonard, 2005; Tairab and Khalaf Al-Naqbi, 2004). These studies have implications for instruction—much less research examines the impact of environments on assessment features and design.

Assessment is critical for improving STEM teaching (Harwell et al. 2015) but assessing science practices, such as conducting data analysis, is difficult to do with closed form questions like multiple choice (e.g., Garfield, 2003; Zieffler et al., 2008). Instead, those practices are often assessed through open-ended activities such as writing, drawing, or presentations. Such open-ended, manual assessments are difficult to conduct at larger scales such as with the hundreds of students found in many introductory college science classes.

Digital tools are widely used for large-scale assessments of students’ procedural knowledge and application of STEM learning (e.g., recognizing and defining terminology, applying knowledge to solve well-defined problems). With notable exceptions (e.g., in biology Urban-Lurain et al., 2013; Beggrow et al., 2014; Weston et al., 2015; Vitale et al., 2015, 2019; Zhai et al., 2020), there are fewer examples of digital tools designed to automatically measure practices in complex, less well-defined problem spaces, in large part because of the difficulty in valid automatic scoring (e.g., Beggrow et al., 2014; Ha and Nehm, 2016). Furthermore, few studies compare assessment between pen and paper vs. digital formats, and those have differing conclusions about student performance across the two environments (e.g., Kumar et al., 1994; Aberg-Bengtsson, 2006; Guimaraes et al., 2018; Oqvist and Nouri, 2018). As development of digital assessment tools increases, the field would benefit from more research on how differences in paper versus digital environment interact with the way students demonstrate understanding through performance tasks.

A key difference in digital assessment tools is the suite of affordances and constraints not present in the blank canvas of a piece of paper and a pen (or generic writing and drawing software). Constraints are almost inevitable with a digital assessment tool when auto-scoring is desired, as the tool will only have a limited set of functionality, and a certain degree of constraint on student actions increases the accuracy and preciseness of scoring algorithms (Scalise and Gifford, 2006; Kim et al., 2017; Meir et al., 2019). However, those constraints can also impact learning to be measured by the assessment (Meir et al., 2019; Cromley et al., 2020). Constraints can preclude measurement of certain aspects of higher-order thinking, but can also help reveal aspects of thinking that were hidden by other confusions in a more lightly constrained pen–and paper assessment. For instance, auto-scored question formats that are intermediately open, between multiple choice and short essays, can help students express their thinking more precisely than in either multiple choice or essay format (Meir et al., 2019). Similarly, affordances of a digital tool may change the graphing practice the students demonstrate due to different or novel aspects of the tool (e.g., Sinclair and Yurita, 2008). Thus, it is important to understand how the affordances and constraints of any digital tool interact with student answers when judging what may be learned from those answers.

In this study, we explore undergraduate biology students’ graphing practices in two different environments: pen-and-paper (PP) and a digital assessment tool GraphSmarts (GS).

Graph Construction Consists of Many Graphing Practices and Decision Points

Undergraduate science students are required to interpret and construct graphs (AAAS, 2009; Kjelvik and Schultheis, 2019; Shanahan et al., 2011), yet studies show that students at all levels from K-12 to post-graduate struggle with graphing (e.g., Angra and Gardner, 2017; Chick, 2004; D’Ambrosio et al., 2004; Roth and Hwang, 2006; Tufte,  1983). By the postsecondary level, students know the mechanics of graphing (e.g., placing data points in the cartesian system—Padilla et al. 1986) but still struggle with higher-order skills such as the relationship between variables and graph type based on the goal of the graphing task. The continued struggles may be evidence that graphing is a practice that must be learned through experience rather than as a cognitive skill (Roth and Bowen, 2001; Bowen and Roth, 2005; Roth and McGinn,  1997). Here “poor” graphing is viewed as lack of graphing experience rather than lack of cognitive ability, so that “graph sense develops gradually as a result of one’s creating graphs and using already designed graphs in a variety of problem contexts that require making sense of data.” (Friel et al., 2001). These practices lead to broadly agreed upon principles of what makes a “good" graph (Kosslyn, 1994; Tufte, 1983,). We build from a body of research on ways in which both developing novices and experienced graphers’ graphs differ from the established norms of good graphs (e.g., Angra and Gardner, 2016, 2017; Bowen et al., 1999; Diong et al., 2018; Weissgerber et al., 2015, 2019) and focus here on five central graphing practices that would enhance graphs (Table 1 and described below).

Table 1 Description of graphing practice concepts, and their application, evaluated in the current study

Variable Relevance. When designing an inquiry or experiment, what variables are relevant to the inquiry (Cobb and Moore, 1997; Mayes et al., 2014; Wild and Pfannkuch, 1999). The measure of those relevant variables (i.e., data) is analyzed, for example, by constructing graphs.

Form of the Data. Descriptive statistics are helpful to search for signals and evidence patterns in a data set (Konold and Pollatsek, 2002). By Form of the Data plotted we mean whether students plotted data as collected or given (i.e., every single data point), versus summarized data (i.e., mean or median of the data set) in a graph. Often summarized data such as means are plotted to compare data from different treatments. Yet, researchers in different fields (e.g., Diong et al., 2018; Friel et al., 2006; Shaughnessy, 2006; Weissgerber et al., 2015, 2019) are calling for plotting raw data as a better illustration of data distribution.

Acknowledging Variability. In exploratory data analysis one must acknowledge and account for variability (Cobb and Moore, 1997; Moore, 1997; Shaughnessy, 2006; Watson et al., 2003; Wild and Pfannkuch, 1999). This practice is more indispensable when plotting summarized data such as means, as means could hide important information about the data distribution (Watson and Moritz, 1998; Friel et al., 2006; Shaughnessy, 2006).

Graph Type. Selecting a graph type is a subjective practice involving choosing “a type of graph from the point of view of being able to characterize shape (of the distribution) and then relating the shape to the context being investigated” (Friel et al., 2006, p. 126). That choice may change as inquirers familiarize themselves with the main characteristics of a data set (Friel et al., 2001; Gelman and Unwin, 2013; Kosslyn, 1985; Tukey, 1977; Wild and Pfannkuch, 1999) but there are some well accepted recommendations. For example, in general, scatter graphs are good to explore the distribution of the data set and identify its outliers, spread, and clusters (Friel et al., 2001, 2006; Weissgerber et al., 2015). Bar graphs are good to communicate results of a treatment or experiment, while line graphs are good to describe and explore relationships among quantitative data points where one could reasonably interpolate between points (Friel et al., 20012006; Kosslyn, 1985, 1994).

Graph Communication. Once the inquirer has explored the data set with all its characteristics and wants to communicate to the public, they need to select a graph type and form of the data that easily help others see the patterns and trends they are wanting to highlight (Friel et al., 2001; Kosslyn, 1985; 1994). This often makes graphing an iterative and subjective process where the inquirer tries many graphs until they identify one that best communicates the data in that context (Friel et al., 2006), and Graph Communication is the ability to evaluate the whole graph more holistically.

Research Questions

Given the background above and paucity of studies on the effect of environment on assessment, this study set out to answer the following two research questions:

  • What are the similarities and differences in undergraduate students’ graphing practices in relation to the graphing environments in which they worked?

  • How do the constraints and affordances of the two environments influence undergraduate students' graphing practices?

Context for the Study

The Digital Graphing Tool

The digital graphing tool GraphSmarts (GS), whose properties we explore here, is being developed as part of a broader project whose goals are to build performance-based assessments of graphing practices. Comparing GS with pen-and-paper (PP) graphing was originally conceived as a form of validation of GS, but the data have proven interesting in their own right.

GS and its evolution are more fully described elsewhere (manuscript in prep), but some of its attributes are important to the comparison with a PP graphing environment. GS had a left-hand panel with six buttons that gave users access to all the features they could add to the graph, as well as some graphing process items such as being able to review the research question (Fig. 1). On the right was the graph they were constructing. As users selected features on the left, the graph on the right reflected those features. We used features of the graphs produced by participants as evidence of their skill applying the concepts related to graph construction practice (Friel et al., 2001; Roth and McGinn, 1997).

Fig. 1
figure 1

Screenshot of the digital graphing tool

The GS design was guided by the evidence-centered design framework (Mislevy, 2013) with a major design goal for the tool of algorithmically scoring students’ performance. We thus included constraints and affordances that we hypothesized would better enable us to accurately capture student graphing practice. For instance, we constrained users to three commonly used graph types: bar, line, and scatter. By limiting the types of graph available, we made it easier to algorithmically determine whether the users’ choice of graph type matched well with the form of the data they were attempting to plot. As another example, knowing that most undergraduate students are already competent at plotting data points (e.g., Brasell and Rowe, 1993), we afforded users increased speed and decreased tedium by plotting the points, bars, and/or lines for them, freeing up time to assess other practices more thoroughly.

The digital tool was embedded in several pages of text and images that introduced the context, research question and data, as well as some multiple choice, intermediate constraint (Meir et al., 2019), and essay questions before and after the graphing task. We do not discuss data from the other questions here. We used the SimUText system from SimBio to serve the tool to students and collect the resulting data.

Biological Context of the Task

Calls for teaching of graphing as part of data analysis suggest using data sets embedded in contexts that represent authentic explorations (Friel et al., 2001; Kjelvik and Schultheis, 2019; Moore, 1997; Weiland, 2017). The problem presented to students involves Marine Protected Areas (MPAs) that have been set aside off the coast of Tasmania to protect marine communities from fishing. One of the marine communities is a well-studied food chain where lobsters (which are fished for human consumption) eat sea urchins, which in turn eat kelp. Much of the marine life in these communities is dependent on “forests” of mature, tall kelp. Too many sea urchins can cut down the kelp in an area, so the presenting research question is “Do Marine Protected Areas (MPAs) in Tasmania succeed in promoting healthy kelp forests?” From this general question, users are asked to examine a specific hypothesis that “Eliminating lobster fishing will result in improved kelp forest health, due to food chain dynamics.” They are asked to test this hypothesis by testing a resulting prediction, that “Areas with no lobster fishing (MPAs) have fewer urchins than do areas with lobster fishing.”

Users are given a data set with eighteen samples taken from eighteen different locations. Each sample contains three quantitative variables (lobster density, lobster size, urchin density) and four categorical variables (MPA status, kelp abundance, plot ID, time of sample). Users first see this data set in table form (Fig. 2). Of these data, time of sample and plot ID are not relevant to the research question. The rest of the variables are relevant to the hypothesis, but only two are relevant to the prediction the user is asked to test (MPA and urchin density). For this comparison, we decided to focus on criteria related to five practices that we could measure regardless of whether users constructed graphs against the prediction or against the more general hypothesis, including variable relevance to the hypothesis (Table 1).

Fig. 2
figure 2

Variables and a subset of the values given for the graphing task

Materials and Methods

Participants

All work with human subjects was done in accordance with an approved human studies protocol (IRB# 1,706,019,374). Participants in this study were 43 undergraduate biology students from two different Midwestern universities, of which 22 were assigned to GS and 21 to PP. Six students were biology majors at a small regional, private university, and 37 students were biology majors at a large, public research-intensive university. To make the two treatment groups as comparable as possible with respect to participant characteristics, participants were assigned to treatments considering both self-reported class standing and race/ethnicity (Table 2) and then randomly within the graphing environment categories. This approach was taken to increase our confidence in the ability to attribute student graphing practices to the graphing environment category and not to participant characteristics. However, we did not have a sufficient sample size to make claims about student graphing in relation to any participant characteristics.

Table 2 Participants graphing environment and demographic information

Two Graphing Environments: Pen-and-Paper Versus Digital Tool

Participants in both graphing environments read the same on-screen three-page introduction to the biology context described above. As part of this introduction, participants were shown a table of the data set from which they could select variables to be graphed (Fig. 2). Following the on-screen introduction, PP participants received the data set on a piece of paper, a blank paper, and a pen that recorded their pen strokes as well as audio (LiveScribe pen ™). Participants in PP were also informed that they could use their cell phone or a calculator to make any calculations they wanted to perform with the data. GS participants went on to a short video introducing the digital graphing tool, and then to a screen with the GS. Each participant was asked to make a graph to address the research question. Participants could make as many exploratory graphs as they wished before submitting a “final graph.” During the graph construction process, participants were asked to “think-aloud.” After making the final graph, participants in both treatments answered some additional verbal interview questions (i.e., What type of graph did you make? Why did you decide to create the graph that you did? How does graphing in this interface compare to other ways you have built graphs?) that had as a main goal to retrieve participants’ justifications for some of their graphing decisions. One researcher witnessed the whole process and conducted the post-graphing semi-structured interview.

In deciding on graphing practices for which to analyze the data (Table 1), we chose only those for which both PP and GS participants were able to use similar feature sets. As an example, participants in PP had a blank piece of paper with a pen, which means they could plot data on any scale they wanted, but needed to draw all elements themselves (i.e., axes and their units, axis breaks if they wanted to have them). Meanwhile, participants in GS were provided with a screen that already had, among other features, the axes. While GS participants did need to assign minimum and maximum values for each axis plotting a quantitative variable, GS evenly spaced the data on the axis, and axis breaks were not available. Thus, because some students in PP used scaling features not available in GS, we did not specifically compare PP and GS participants on those features. We discuss some graphing practices we saw used in one environment but not the other in a qualitative sense only.

Data Collected

This study aimed to understand biology students’ graphing practices in two different environments through a mixed-methods approach (Patton, 2015). The data for this study comprise (1) all participants’ graph information (variable plotted, graph type, whether the data plotted was raw or summarized data, the total number of graphs created) and (2) audio records of participants’ answers to set questions during a semi-structured interview (see Sec 2.2, above). The combination of these different data sources resulted in a rich supply of information that provided a complete picture than would have been possible with a single method of data collection. We analyzed these data in two ways, as described below.

Evaluating Participants’ Graphs

The first four graph construction practice concepts in Table 1 were simple to score. Form of the Data, Graph Type, and Acknowledging Variability are read directly off each graph. To score Selection of Variables we noted that a graph that did not include the variable MPA Status would do a poor job of addressing the research question, so we scored whether that variable was included on either one of the axes, or a distinct feature of points plotted (e.g., color). In PP participants were able to select a subset of the data to be plotted (e.g., only MPA yes), an action that was not available for GS participants. Therefore, to be consistent across the two environments, we did not differentiate whether participants plotted the complete data set (MPA yes and no) or a subset (only MPA yes).

For the more synthetic concept of Graph Communication we scored each participant’s graph being of high, medium, or low quality. As shown in Fig. 3, we based the graph quality determination on the four more atomic criteria we scored but considered whether they were combined in ways that did a good job of communicating the data. For this data set, we judged that these criteria reflect the ease with which a viewer can discern patterns in the data.

Fig. 3
figure 3

The rules used to score graph quality. The segments show how a graph was evaluated as a high-, medium-, or low-level

Figure 4 shows examples of how we scored the quality of several graphs made in both PP and GS graphing environments according to the rules in Fig. 3.

Fig. 4
figure 4

Examples of graphs made in PP and GS grouped by the graph quality scores they received

To provide quantitative insight in answering our research questions, where possible, we compared the proportion of participant scores between the two graphing environments using Fisher’s Exact tests in R 3.5.0 (R Core Team, 2018). We chose this test because of the structure of our data and analyses (categorical levels), our research questions, and its appropriateness for small sample sizes in the categories being compared. Finally, to account for the multiple comparisons made, while reducing the risk of Type II errors, we applied a Holm-Bonferroni correction to all tests; this takes a sequential approach to applying Bonferroni corrections such that the likelihood of Type II error is not increased (Holm 1979). We draw inferences based on this correction throughout.

Analyzing Participants’ Justifications

In addition to considering how the visible products participants produced might be affected by graphing environment, we also wished to address whether the graphing environment affected participants’ graph construction thought process. As a window to their thinking, we analyzed participants’ justifications for their graph type as expressed in the interviews after they completed their final graph in response to the set questions from the interviewer (see “Two Graphing Environments: Pen-and-paper Versus Digital Tool”) We used the constant comparative method (Strauss and Corbin, 1994) to build categories of participants’ justification, and then three researchers coded to consensus the transcript of participants’ justifications. Our coding process led to the following four intertwined categories of justification, which were not mutually exclusive: data visualization, data exploration, data characteristics, and ease of use (Table 3).

Table 3 Categories of participant graph type and variable selection justifications

Once participants’ justifications were analyzed using the described codes, we explored patterns related to participants’ graph levels and justifications.

Results

Participants’ Graphs Show a Mix of Similarities and Differences Between Environments

We scored participants’ final graphs (those they reported as final to the interviewer) along the five criteria we defined above (Table 1) for evaluating participants’ graphs. Participants using the two graphing formats produced similar graphs on three criteria, and different graphs on two other criteria (Fig. 5; Table 4). We discuss each criterion in turn below.

Fig. 5
figure 5

Summary of graphs between the GS and PP environments. Displayed are the number of final graphs with the relevant variable, MPA (a), displaying raw or computed data (b) of different graph types (c), and that were in the low, medium, or high graph communication quality level (d). All stippled bars are GS and solid bars are PP

Table 4 Differences in graphing practices between formats

Selection of variables. As described in the methods, we looked at whether one relevant variable, called MPA, was plotted on one of the axes or was used to color data points. A much higher percentage of participants using GS included MPA on their graph then those using PP (Fig. 5a), and this was a significant difference (p = 0.007).

Form of the data. Regardless of graphing environment, the majority of the participants (84%) plotted raw data instead of summarized data (Fig. 5b), and we saw no statistical difference between participants using GS vs PP (p = 0.412).

Acknowledging variability. Few students in either graphing environment plotted summarized data in their final graphs (7 of 43), but of the five participants who plotted summarized data in GS, four added error bars, while neither of the two participants who plotted summarized data in PP added error bars.

Graph type. Participants using both graphing environments tended to choose either scatter or bar graphs over line graphs, as was appropriate for the data and prediction being tested (Fig. 5c). We saw no significant difference in graph type choice between environments (p = 0.559).

Graph communication. Using the graph quality rules in Fig. 3, most graphs in both graphing environments were categorized as medium-level (Fig. 5d). Although there is a visually apparent shift towards higher level graphs among GS participants compared with PP, this was not significant (p = 0.0723).

Potential Impact of GS Affordances on Data Exploration and Graph Communication

The most dramatic affordance of GS Potential Impact we observed was enabling students to explore the data by making many more graphs on average than in PP (Fig. 6). While no participant in PP made more than two graphs, and the vast majority (17/21) made but a single graph, almost half (10/22) of participants in GS made more than two graphs.

Fig. 6
figure 6

Graph communication quality for succession of graphs made by each participant. GS participants on the left, and PP participants on the right. For each numbered participant, the rectangles each represent one graph made, in the order they were made from left to right. Shading represents the graph quality according to the rules shown in Fig. 3

Student comments supported the idea that digital tools facilitate exploration. For instance, one student in the PP treatment commented that they found it harder than making graphs by computer because with a program like Excel “you don’t have to worry about putting dots in the right place and you can do like calculations easy and make your lines.” (ID 7711, PP, Final graph = high). Several students in the Potential Impact treatment compared the tool with using Excel, with mixed thoughts on which was preferred but almost universally using the word “easy” in their comments. For instance, “[Excel] was a lot easier to play around with it and see which one you like” (ID 5507, GS, Final graph = high) from a student that preferred Excel, or “[GS] is easier [than Excel] because there are a lot of less noise there is just the buttons that you really need and that makes it very easy” (ID 9112, GS, Final graph = high). Regardless of which interface they preferred, students seem to associate digital graphing tools with ease of use.

It is a trickier question to assess whether the ease of making multiple graphs affected the participants’ graphing practices. As reported above, although there were more graphs of the highest quality in GS, the differences in graph communication quality between formats was not significant, overall. However, among GS participants who made more than two graphs, there appears to be a trend from lower quality initial graphs to higher quality final graphs (Fig. 6). In particular, none of those participants started with a high-quality graph, but 3/10 had a high-quality final graph.

Because the use of a digital tool enables quick data exploration and graphing, we explored the relationship between the time to create the first graph and the graphing environment. We did not see a significant difference between the two (p > 0.05). We further explored any relationship between graph quality level and time to complete the first graph and again saw no trend notable enough to comment on.

Graph Type Justifications Differed Between PP and GS Participants

We asked each participant to describe why they had selected a particular graph type (bar, line, scatter, etc.) in their final graph, and then classified these into four types of answers (see Methods). Many students from both PP and GS environments mentioned good data visualization and the characteristics of the data as rationales for the type of graph they chose, and we did not see much difference in the frequency of those responses between groups (Fig. 7). For instance, students made statements such as “A bar graph allowed me to see the average instead of all the points” (ID 7382, GS). The most common answer for participants using the PP format, though, related to how easy it was to draw a particular graph. PP participants picked graphs that they could draw with less effort, something that was much less common among GS participants. For instance, PP participants made statements such as “I ended up making a line graph to save time” (ID 6030, PP).

Fig. 7
figure 7

Participants graph type justifications compared between graphing environments. Shown are the percent of people between the two graphing environments who stated a given justification for the graph type chosen. In parentheses are the total numbers of participants stating the justification. A given participant may have stated more than one justification category. Stippled bars are GS and solid bars are PP

There were also more PP than GS students that mentioned being able to explore the data as a justification for the graph type they chose, but the difference between groups was smaller than with ease of use (Fig. 7). We did not see differences in graph type justifications in relation to graph quality that were large enough to comment on.

Constraints of GS Can Both Reveal and Obscure Participants’ Graphing Practices

As part of the design of the GS graphing tool to facilitate auto-scoring and to decrease the number of dimensions to be scored, we implemented a number of constraints in the GS environment compared with the PP environment (see Introduction), which had effects on student graphing practices. We saw this in the type and nature of the graphs some participants made in the PP environment which were not possible in the GS environment (11/21 participant graphs), such as graph types other than bar/scatter/line, graphs with multiple y-axes, multi-panel graphs comprised subsets of the data (e.g., one graph for MPA and another for non-MPA), trendlines and axis breaks, and plotting all of the variables (Fig. 8). The most common graphing difference was the inclusion of multiple variables (e.g., urchin and lobster density) plotted using the same y-axis (4/21 participant graphs) or by adding a second y-axis (1/21 participant graphs) (Fig. 8; 6220 and 8587). Of particular interest, the ability of PP environment graphers to plot more than three variables reveals more about their variable relevance graphing practice than might be captured in the constrained GS environment (4/21 participant graphs) (Fig. 8; 7711, 6220, 8587).

Fig. 8
figure 8

Examples of graphs with features not possible in GS

Discussion

Undergraduate Students Performed Similarly on Several, But Not All, Criteria Between Graphing Environments

To address our first research question, asking about similarities and differences of student graphing practices in relation to the graphing environment, we compared the graphs students made in PP versus those in GS on five criteria. On three of the criteria, graph type, form of the data, and graph quality, there were no large differences in participant performance between environments. Plotting raw data rather than summarized data conforms with the findings of previous studies on student graph construction as well (Angra and Gardner, 2017). It could also be that participants considered the size of the data set (18 points per variable) small enough to plot every single data point. Future scenarios could include a larger data set to explore this graphing practice further.

Virtually all participants in GS plotted MPA status; while the majority of those in PP plotted MPA status as well, significantly more PP than GS participants chose not to plot MPA status. That most participants plotted MPA status indicates that there was high understanding among participants of the biological scenario and the question being posed, and it seems unlikely to us that more PP participants, uniquely, did not understand the scenario of data since both treatments read the background material in the identical digital environment. Although we speculate below on how the affordances and constraints of GS may have led to other differences, no obvious hypothesis comes to mind as to why the GS as opposed to the PP environment would have guided students towards selecting MPA status.

The other criteria where we noted a difference, albeit with a very small sample size, is in accounting for variability. Here, as we discuss more below, we speculate that features of GS may have led to or made it easier for students to plot error bars.

To generalize, it appears that many aspects of students’ graphing practices can be captured similarly in a digital assessment as with pen-and-paper.

The Constraints and Affordances of Digital Assessments May Activate Latent Knowledge or Lead to Just-in-Time Learning

To address our second research question, asking how constraints and affordances of graphing environments may affect student graphing practices, we used both records of students graphing exploration as well as qualitative data from think-aloud interviews. Students in PP commented on how ease of drawing was a factor in the choices they made in constructing their graphs (see quote from ID 2143), and only in GS did students make more than two graphs, in many cases improving their graphs in the process. Thus, the affordance of GS in allowing quick construction of graphs may allow students to focus more on higher-order graph construction practices such as the shape of the distribution and the context of the inquiry (Friel et al., 2006; Aberg-Bengtsson, 2006), providing a better picture of their abilities and understanding.

As previous studies have reported (de Freitas and Sinclair, 2012, Sinclair and de Freitas, 2013; de Freitas and Sinclair 2014; Sinclair and Yurita, 2008) the constraints of a digital tool like GS may also affect the skill students demonstrate. Only certain options are available in GS, and the visibility of those options may remind students of those possibilities in constructing their graphs—both possibilities that lead to better graphs, and those that lead to worse graphs. There are indications that, on average, the constraints in GS prevented poor graphing or activated latent knowledge of graph construction that helped participants produce what we evaluated as high level graphs. Several PP participants (6/21 participants) may have benefited from the GS constraints in not being allowed to plot extra, irrelevant variables (4/21 participants) or a subset of the data (i.e., only MPA yes data, 2/21 participants).

The affordance of easier graphing and constraint options may also lead to learning through the process of taking the assessment. Of particular interest is that participants in GS who made more than two graphs tended to start out with worse graphs than those who made just one or two graphs (Fig. 5), and further, that those multiple graph participants tended to improve between their first and final graph. This may be an indication of activating latent skill with graphing practice concepts and/or a demonstration of learning within the assessment. We speculate that students who had more experiences graphing to begin with knew exactly what an appropriate graph would look like and went straight to constructing that graph. Those with less experience may have needed to play around more, but in the process of playing found their way (in many cases) to a higher quality graph. This interpretation is supported by previous studies involving the role of the graphing environment (Sinclair and de Freitas 2013), graphing experience (Roth and McGinn, 1997), work on other assessments showing such learning through intermediate constraint question formats (Meir et al., 2019) and in a constrained experimental design task (Meir, in preparation). While prior studies explicitly comparing science skill assessment by environment are rare, at least one prior study in a similar context also found that students performed better on digital assessment than on pen-and-paper, and similarly speculated that the constraints and affordances aided students in drawing on latent knowledge (Kumar et al., 1994).

To generalize, the constraints and affordances in a digital assessment may aid students to show higher skill than in a pen-and-paper environment, and this may, on balance, provide a better representation of the students’ true level of skill.

Limitations of the Study

The results we discuss above are general in that we demonstrate the existence of these similarities (such as similar assessments of certain practices) and differences (such as the affordance of easier graph creation) as a possibility. However, we only present data for a single, idiosyncratic digital tool, and a single data set and hypothesis against which students were asked to graph. Other tools and presenting stories may lead to different results. In particular, we could imagine different results if students were presented with larger data sets, or with different levels of constraint in the digital tool.

The comparison between the GS and pen-and-paper graphing environments was not meant to be a one-to-one comparison with graphing environment as the only variable. Rather, some of the features of the GS were based on our previous work (Angra and Gardner, 2016, 2017) and were meant to represent evidence-based improvements for instruction and assessment, as mentioned in the section above. Further, we narrowed the range of possible graphing decisions and actions in the GS graphing environment in order to interpret our data and discern patterns. Thus, it is likely that the differences between the two environments had additional indirect effects on the student graphing practices we observed. For example, while the ease of switching between graph types in GS may have promoted an exploration of the data for some participants, this may have led students to focus more on aesthetics rather than selecting a graph that was appropriate for the data and purpose (Tairab and Khalaf Al-Naqbi, 2004). The ease of calculating averages in the GS condition compared with the pen-and-paper condition may have affected the way in which students thought about the data and variation (Konold et al., 2015; Lehrer and Schauble, 2007). As a final example, the ability of students in the pen-and-paper condition to plot all the data and multiple y-axes could have affected their focus on carefully selecting the most relevant variables to plot and affected their ability to evaluate the prediction (Tairab and Khalaf Al-Naqbi, 2004; Patterson and Leonard, 2005). These examples highlight the importance of comparative research when introducing new technological approaches to student practices.

Finally, some of our conclusions are based on small numbers of students, and thus, the statistical results we present could change in a larger study. Power analysis indicates that we could detect moderate to larger treatment effects, so smaller differences in student behaviors between environments may not have been captured. But due to alignment of quantitative results here with the qualitative data we collected and the previous work of our own and others (Angra and Gardner, 2017; Aberg-Bengtsson, 2006), we think it likely that alternate effects which showed up in larger studies would be relatively small in magnitude.

Digital Performance-Based Assessments Have Potential to Algorithmically Assess Higher-Order Skills and Improve Evidence-Based Teaching

Our data are promising both for research on student understanding of higher-order skills and for helping instructors improve their classes using just-in-time data. On three criteria, student work was similar in the two environments. We also have hints that student work was a more authentic representation of student graphing practices in GS than PP, as evidenced by the number of graphs made, their justifications for their final graph, and the quality of graphs for those participants that only made one or two. Students who may have had the experience with graphing practices to make higher quality graphs did not do so in PP because of the effort involved. In GS, where effort was lower, such students were able to demonstrate their graphing practices more fully. Thus, GS may more authentically represent student graphing practice in some respects. The light constraints enable auto-scoring of the results (Meir, in preparation), which make such assessments feasible to use on a larger scale and on the faster timelines needed for instructors to modify their teaching on the fly. We expect that these advantages of GS are likely to be broader than our particular tool and these results should encourage the development of other digital performance-based assessments of higher-order skills using similar design criteria.

Our results also support using similar features in teaching tools, and in fact an assessment tool such as GS, with small modifications, might additionally serve as a teaching tool. The constraints serve as a form of scaffolding for student thinking about data and graphing and may address previous concerns related to students too rapidly producing graphs without reflection and thoughtful deliberation (Patterson and Leonard, 2005; Tairab and Khalaf Al-Naqbi, 2004).