Background

Academic integrity research has been published dating back to at least to the 1900s. Academic integrity publications span disciplines and research is published under a variety of different themes. Barnes (1904) published the paper “Student honor: A study in cheating”, providing an early example of using surveys to conduct qualitative and quantitative academic integrity research. Although Barnes’ research may have not been presented with the rigour of some more recent academic integrity investigations, it still identified themes that are commonly discussed in the field today.

In Barnes (1904) students were asked how they would respond to an academic integrity scenario that had occurred in real life and for which a variety of academic penalties were available. In the scenario, examination questions were said to have been stolen, giving the students who had to access to them an unfair advantage over their peers. The responses received included disparate opinions from students regarding whether it was their business to get involved further. Thirty percent of male students and 35% of female students felt that reporting was necessary so that they would not be unfairly judged against other students who they now expected would benefit from better results. Barnes also noted differences in responses between genders, a theme which is still regularly investigated as part of academic integrity research to this day.

Barnes' choice of paper title does rather seem to present their own position on academic integrity issues. This is perhaps most clearly summed up in this quote from the paper:

“The reasons are mainly selfish; the university's interests are far less important than self-protection; while general social responsibility is comparatively little felt.”

Despite being only six words long, the paper title “Student honor: A study in cheating” brings together two words at different ends of the academic integrity spectrum, honor and cheating. The word honor carries with it an expectation that students will act with positive integrity. The word cheating has negative connotations, with the suggestion of students getting an unfair advantage. A focus on transgressive behaviour is not necessarily wrong, but also leads to missed opportunities for people working in the academic integrity field. The conflict between whether academic integrity should be framed in a positive or negative manner still exists in paper titles today and is the focal point of the research investigation presented in this paper.

This paper uses Natural Language Processing (NLP) techniques to provide a data-driven investigation into how academic integrity paper titles have been constructed between 1904 and 2019. The research presented examines the titles of 8,507 papers published in the wider academic integrity field and is used to see how far such titles are presented to readers using a positive or negative approach. The results are intended to help the academic integrity research field to determine if it wishes to present itself from a more positive direction.

Academic Integrity

The literature on academic integrity often considers this field through both positive and negative viewpoints, with integrity itself considered as a positive term. A look at the different opinions and presentations of this research is useful to help define how the field is changing, as well as to allow positive integrity and negative integrity ideas to be demonstrated through representative examples.

The popularisation of the term academic integrity is commonly attributed to McCabe. Despite this, in the single most highly cited paper in the field “Academic dishonesty: honor codes and other contextual influences”, McCabe and Trevino (1993) do not discuss academic integrity, but instead academic dishonesty. In the paper, McCabe and Trevino collected data using a survey methodology and discussed how this could be used to predict academic dishonesty. Despite its high citation level, the focus of both the paper title and content brings connotations of a negative presentation of integrity.

Similar observations to those made about McCabe and Trevino (1993) also appear in a literature review by Macfarlane et al. (2014). They examined 115 articles in the field across both Western and Chinese literature. Their review concluded that academic integrity is commonly defined by reference to misconduct, fraud and corruption. This paper will consider research with a focus on areas such as these as being representative of negative integrity.

An alternative group of approaches are possible. This paper will consider such approaches as representative of positive integrity, often represented by the pure term academic integrity. Macfarlane et al. (2014) define academic integrity as “the values, behaviour and conduct of academics in all aspects of their practice”. An alternative definition, given by East and Donnelly (2012) based on the values of the institution they work for is “academic integrity means being honest in academic work and taking responsibility”. That interpretation is close to how sector organisation the International Centre for Academic Integrity (ICAI) present this. ICAI take a positive integrity view and define this concept in terms of core values by asking members to commit to “honesty, trust, fairness, respect and responsibility” (Fishman, 2014).

Fishman (2016) discusses the variety of frameworks which academic integrity is presented under in the United States. These include moral and ethical frameworks, pedagogical frameworks, legalistic frameworks, comparing academic integrity with criminal behaviour and even considering this as a form of disease. Although these frameworks provide some opportunity for a positive discussion, the most immediate interpretation is that academic integrity should be viewed through a negative lens.

There have been opportunities for the negative viewpoint to change. McCabe and Pavela (2004) discuss principles they believe will help build a culture of academic integrity, such as making this an institutional value with consistent standards, clarifying expectations with students, enabling students to take responsibility and ensuring fair assessment. How academic integrity principles are taught to students and how far teaching can take a positive approach continues to be an important part of the modern discussion (Ransome & Newton, 2018; Sefcik et al., 2020).

One underlying principle regarding making academic integrity work at an institutional level is that it should apply to the whole academic community, not just to students and not just to academics. The student voice is being increasingly considered as an essential and important part of this discussion (Pitt et al., 2020).

The fields of research studied within academic integrity have widened in recent years, with new areas developing as a result of observing threats to academic integrity. Some identified challenges include cybersecurity threats (Dawson, 2020), contract cheating (Clarke & Lancaster, 2006), study helper websites (Harrison et al., 2020) and paraphrasing tools (Prentice & Kinden, 2018). The positioning of research discussing threats to integrity and opportunities for student misconduct suggest a continuing view of negative integrity. The fast pace of technological change and the need to raise awareness of this further suggest that a certain level of negative integrity research will always be required within the field.

The widening of the academic integrity research field and the growth of technology has brought with it the opportunity for innovation in how academic integrity research is conducted. Methodologies have moved beyond surveys. Social media analysis can be used to investigate why students cheat (Amigud & Lancaster, 2019). Region and sector specific literature reviews are possible (Eaton & Edino, 2018). Internal academic conduct records can be analysed (Atkinson et al., 2019). Others have had success working around analysing existing policies (Eaton et al., 2020). There is plenty of alternative data available that can be examined.

This paper takes such an alternative and data-driven research approach. It considers existing data relating to published academic integrity research and uses NLP techniques to programmatically examine this data.

For the purpose of this paper, the view that academic integrity applies to everyone is supported, but this is balanced by the observation that papers are most relevant if they fit within an educational setting. As such, the interpretation considers academic integrity as it applies to teaching, learning, pedagogy and education, where students, academics and professional university staff are at the forefront of the conversation. The related field of research integrity is sometimes included with academic integrity, but to award muddying the water it is only included in the investigations reported here when this also relates to education.

Although this paper makes no attempt to provide a fresh definition of academic integrity, the approach taken naturally identifies papers with examples of both positive integrity and negative integrity. One side product of looking at both positive and negative views is that it is hoped the range of papers, topics and issues identified will help to inform future definitions of academic integrity so they can both be current and complete.

Investigative Methodology

Formation of the Primary Data Set

The research presented in this paper relies on a data-driven approach. Data was collected in May 2020 to form a primary data set. From this four further secondary data sets were derived.

The procedure through which the data sets were gathered and processed employed standard techniques from the domains of NLP and machine learning. As is customary in this field, experimentation was undertaken on the initial data to determine how best to present it for NLP. Some of the final decisions presented here may appear arbitrary, but they were made to fine tune the results for readability and accuracy. The pipeline is presented to provide enough information for researchers looking to undertake related studies, whilst recognising this paper is aimed at the academic integrity field, an audience who may be unfamiliar with NLP.

Google Scholar was used as the primary data source to identify academic integrity research publications. Data was collected from Google Scholar through an iterative process, with the aim of ensuring data set completeness and consistency. Both manual and automated checks and corrections were made on the resulting data. Excel and Python were used extensively to support data collection and processing with several scripts developed for internal use. The NLP aspects of processing relied heavily on the NLTK platform. Sentiment analysis, a process where the subjective information in a written expression is evaluated to identify the tone of the expression, was completed used the VADER toolkit. Both NLTK and VADER are open source.

Figure 1 provides a high-level overview of the data collection, cleansing and processing pipeline.

Fig. 1
figure 1

Data set formation pipeline

As Fig. 1 indicates, an initial set of search terms for Google Scholar were identified. These included such terms as academic integrity, academic dishonesty, student plagiarism and contract cheating. In each case, a search for these terms in the title of documents was conducted. This search type meant that a term like student plagiarism would match the word student and the word plagiarism used anywhere within a title and with the words in either order. The results were manually inspected to identify other possible search terms. Subsequently a list of bigrams (two consecutive words) in the titles was generated to identify more possible search terms. Frequently occurring bigrams related to the wider academic integrity area were also used. This process identified, for example, the term academic honesty as an alternative to academic dishonesty. The process also suggested the term research integrity, but this was deliberately excluded to avoid adding large quantities of papers to the initial data set that were unrelated to teaching, education or students. Nevertheless, some research integrity papers do appear in the final evaluation where these were identified through other terms and did prove to be relevant to academic integrity. Table 1 shows the final set of search terms that were used.

Table 1 Google Scholar search terms used to generate the initial data set

The initial data set required extensive cleansing. The process which Google Scholar uses to crawl research papers and generate its own records is error prone and so the initial data set contained many duplicate entries, for example where one version had a slightly incorrect title, listed authors in different orders or had author name variants. There were many cases where unsuccessful parsing of research documents had generated incorrect information in the Google Scholar database. In addition, information standing out as potentially suspect was cross-referenced against other sources. One such example was an article about student cheating and the Internet, allegedly published in 1970, whereas a check on the journal’s own web pages revealed the correct date.

A further pruning process was necessary to limit the initial data set to only include papers that were research related and on subjects in the wider academic field. Table 2 summarises some of the main criteria applied to identify if papers should be included in or excluded from the primary data set. No direct attempt was made to judge the quality of the papers or exclude those published in predatory journals, although the development of secondary data sets of papers that the academic integrity community considers most important does indirectly address this possible limitation.

Table 2 Inclusion and exclusion criteria for the primary data set

The primary data set contained information about 8,507 research sources published between 1904 and 2019. A cumulative frequency graph showing when the papers were published in shown in Fig. 2.

Fig. 2
figure 2

Cumulative frequency chart of academic integrity paper publications

Figure 2 indicates that the rate of increase of publications in the academic integrity field has been exponential. There were approximately the same number of papers published between 1904 and 2011 as there were between 2012 and 2019. Although this may seem like a steep rate of increase, worldwide science and engineering publications were found to have grown at a rate of 4% per year between 2008 and 2018 (White, 2019). The corresponding figure for academic integrity publications is just below 3% per year.

Formation of the Secondary Data Sets

A further four smaller secondary data sets, all subsets of the primary data set, were developed to allow for a more detailed investigation. These data sets are summarised in Table 3.

Table 3 Summary of primary and secondary data sets

Data sets B, C and D consider the most cited papers of all time. These are intended to represent the papers that have overall influence on the academic integrity field. Although these data sets would seem to favour older publications, the 1000 most cited papers data set (D) does contain papers dated as recently as 2019. In general, the primary data set (A) is where most recently published research can be found. This is illustrated in Fig. 3, which shows the relative cumulative frequencies of publications in the five data sets, truncated to start from 1979.

Fig. 3
figure 3

Relative cumulative frequencies of paper publications in data sets (1979 to 2019)

Computation of Individual Primary and Secondary Data Set Records

Individual records were generated for each paper included in the data sets through a combination of continued data cleansing and the application of NLP techniques.

The paper title information obtained from Google Scholar was tokenised to represent paper titles as a series of words of interest. This included the removal of common English language stop words (“and”, “the”, “of” etc.) based off a standard list for the library used.Footnote 1 In addition, the word “among” was removed. Two further minor changes were made to improve wording that was not picked up by the standard tokenisation process. This saw the token “student” replaced by “students” and the token “toward” replaced by “towards”. This decision was made to allow these common terms to be clustered together and improve the readability of the final results, rather than see two similar terms occupy two lots of results and confuse matters.

Table 4 shows the information collected and computed for each record in the data sets, along with an indicative example. As well as information collected directly from Google Scholar and subsequently cleansed, this includes information computed using standard NLP techniques of unigrams, bigrams and trigrams. A sentiment analysis score for each title is also calculated to determine if this represents positive, neutral or negative integrity. In the case of the example of Eshet et al. (2014) shown in Table 4, the overall sentiment is considered to be negative, with the machine learning process likely to have made this judgement through the use of the terms “traits” and “academic dishonesty” in the paper title.

Table 4 Data attributes used in this study for each record in the data sets

Research Methodology Limitations

Some natural limitations of the approach used within this investigation are worthy of mention. The data sets represent a snapshot of content on a live source of data, one that continually receives updates and corrections. The volume of citations observed in May 2020 will be different to that which would have been seen at the end of 2019, the cut-off point for including papers in the data sets. Even then, official publication dates can differ from the date papers were first available to be read and cited. This stems from the advent of papers being published online first before they are assigned to a journal issue.

The analysis presented focuses on paper titles, rather than paper abstracts or their contents. This assumes that titles accurately reflect the contents of the papers. There will be exceptions to this, for instance when a title is written for shock value or to encourage readership, in much the same way that newspaper headlines can be written to draw attention. The tendency for authors to think about search engine optimisation when organising papers is also a relatively recent change that may have influenced the choice of titles within the field.

Only a single source, Google Scholar, is used for data collection. The quality of the data sets is limited to the quality of the underlying source. The resulting data sets did require manual clean up and it is likely that a small number of errors remain. The time afforded for data cleansing and consistency checking was used strategically, focusing most on the secondary data sets since these are likely to have the greatest influence on future practice. Small errors in the primary data set (A) of 8,507 items should have no discernible effect on the accuracy of the overall results. These results are still of importance to the wider academic integrity research field.

Results and Discussion

Most Frequently Occuring Unigrams, Bigrams and Trigrams in Paper Titles

Table 5 summarises the 10 unigrams, bigrams and trigrams seen most frequently in the primary data set (A). Each of these measures provides insight into academic integrity research at different levels of granularity.

Table 5 Most frequent unigrams, bigrams and trigrams in primary data set

The unigram data indicates that 7,161 unique unigrams were observed in the primary data set, with a total of 60,402 occurrences. This shows a mean of 8.43 occurrences per unigram and a standard deviation of 477.63. The top 10 ranked unigrams covered 17,203 of those occurrences between then (28.48%). The unigram list does not appear to be particularly insightful.

The bigram data provides more useful level of granularity, with 30,169 unique bigrams and a total of 51,898 occurrences. That is a mean of 1.72 occurrences per bigram, with standard deviation of 11.22. The top 10 ranked bigrams cover 4,620 occurrences (8.90%). The first and second ranked bigrams “academic integrity” (seen in 1,234 occurrences, that is 2.37%) and “academic dishonesty” (seen in 1,068 occurrences or 2.06%) indicate the close relationship between the use of these two terms.

The trigram data indicates variety across paper titles, with 36,417 unique trigrams observed across 43,420 occurrences, giving a mean of 1.19 occurrences per bigram and a standard deviation of 1.42. Between them, the top 10 ranked trigrams cover only 524 occurrences (1.21%). The terms make intuitive sense and the quadgram “source code plagiarism detection” stands out as formable from the second and fifth ranked trigrams, indicating the interest in academic integrity techniques often considered most specific to Computer Science.

Exploration of Bigram Data

The bigram level provides the opportunity to further explore the primary and secondary data sets. Figure 4 shows the data obtained from the primary data set in more detail.

Fig. 4
figure 4

Top 25 bigrams observed in primary data set

Categorising the bigrams provides an indication of the topics of most interest to academic integrity researchers. The positive integrity terms “academic integrity”, “academic honesty” and “integrity education” can be combined to cover 1,568 out of 51,898 occurrences (3.02%). Accordingly, the negative integrity terms “academic dishonesty”, “academic misconduct” and “student cheating” can be combined to give 1,792 out of 51.898 (3.45%) occurrences, suggesting a slight bias towards negativity in paper titles. Other terms suggest wider areas of interest, including the academic level of students (such as university, college and high school), academic integrity challenges (such as plagiarism and contract cheating), methods of addressing challenges (such as through case studies and plagiarism detection), issues of interest to particular research sub groups (academic writing, source code plagiarism and plagiarism detection), as well as the type of data hoped to be gathered in many research projects (perceptions and attitudes).

The data sets were further interrogated to identify how many of the 25 most frequent bigrams occurred in the paper titles, as well as the number of the three positive bigrams (“academic integrity”, “academic honesty” and “integrity education”) and three negative bigrams (“academic dishonesty”, “academic misconduct” and “student cheating”) seen in those titles. Particular attention is paid to the most prolific of those terms, “academic integrity” and “academic dishonesty” and these are also analysed separately. The results are seen in Tables 6 and 7.

Table 6 Average (Mean) Numbers of 25 Most Frequent Bigrams Found in Paper Titles
Table 7 Standard Deviation of Numbers of 25 Most Frequent Bigrams Found in Paper Titles

Table 6 suggests that when more of the most frequently occurring bigrams are included in paper titles, those papers are more likely to be cited. They also suggest that the interest in negative integrity is greater than that in positive integrity. For example, in the 1000 most cited papers data set (D), the average paper title contains 0.113 out of the 3 positive bigrams, but contains 0.240 out of the 3 negative bigrams, an increase of 112.39%. A similar finding can be observed when comparing the use of the bigram “academic integrity” with “academic dishonesty”. The negative bigram is most frequent in all three of the most cited data sets (B, C and D).

There is one clear exception to this finding and that comes from the most prolific authors data set (E). This group uses 0.330 out of 3 positive bigrams per paper title, accompanied by only 0.205 out of 3 negative bigrams, showing that the titles they use contain 60.98% more positive than negative bigrams. Similarly, the prolific authors use the bigram “academic integrity” in almost one third of their paper titles, 136.76% more of the time than they use “academic misconduct”.

Further analysis show variety in number of the 25 most frequent bigrams included in paper titles. This is summarised in Table 8.

Table 8 Percentage of Paper Titles Containing 25 Most Frequent Bigrams

From the primary data set (A), 43.52% of paper titles do not contain any of the 25 most frequent bigrams. For the most prolific authors data set (E), that percentage is only 23.48%, suggested that the researchers writing regularly in this field are familiar with the wider literature, the research base and the terminology to use. In all five data sets, the modal number of the most frequent bigrams used in a paper title is 1. There are 9 cases out of 8,507 records (0.11%) where four bigrams are used, sometimes as part of overlapping bigrams.

Table 9 compares the use of the bigrams “academic integrity” and “academic dishonesty” in paper titles. This indicates a perhaps alarming result, that papers in the field are more likely to be highly cited if they take a negative integrity stance. Once again, the most prolific authors buck this trend. From data set B, none of the 10 most academic integrity papers of all time contain “academic integrity” in their title, or indeed any of the positive keywords that have been identified from the 25 most frequently used bigrams.

Table 9 Percentage of Paper Titles Containing Specific Bigrams

Sentiment Analysis

The paper titles in the data sets were analysed using sentiment analysis techniques to determine if they represented positive, neutral or negative sentiment. A summary of the percentage of paper titles falling within each of these sentiments is shown in Table 10.

Table 10 Sentiment Analysis of Data Sets

For data sets A to D, the modal sentiment is neutral, although the most interesting comparisons lie between positive and negative sentiment. The results from the primary data set (A) show that titles are slightly more likely to be viewed as having positive sentiment rather than negative sentiment, but this is not consistently the case across all the data sets. In particular, the most cited data sets (B, C and D) show more negative than positive sentiment.

Different results are seen from the most prolific authors data set (E), where 120 out of 264 paper titles (45.45%) are computed to have positive sentiment, making this the modal sentiment group. Since the negative sentiment group contains 60 paper titles, this represents a 100% increase.

Considering a null hypothesis that, if randomly and independently determined, one third of paper titles should each show positive, neutral and negative sentiment, Pearson’s chi-squared test shows statistical significance for data sets A, C and D at the 0.001% level and for data set C at the 5% level. Data set B does not show statistical significance, but strictly speaking the sample size is too small for Pearson’s test to be valid.

A further element of investigation aims to address how the sentiment of paper titles has developed over time. The results are shown in Fig. 5.

Fig. 5
figure 5

Cumulative Sentiment Analysis of Paper Titles

Figure 5 provides a cumulative plot of the percentage of paper titles that were computed to have positive, neutral and negative sentiment. That is, Fig. 5 shows the sentiment results from all the papers published up to a given point in time. This trend shows promise if a move towards positive integrity is considered desirable. Although historically the sentiment of paper titles has been strongly negative, neutral sentiment overtook negative sentiment for the first time in 2003. Positive sentiment then overtook negative sentiment in 2008. The current trend shows a continued decline in the use of negative sentiment.

The answer to one final question may interest researchers in this field. Does having positive sentiment in a paper title title affect the number of citations that paper is likely to obtain? Across the primary data set (A) as a whole, papers received an average of 14.27 citations. The paper titles with positive sentiment received 10.60 citations. The papers with neutral sentiment received 15.39 citations. The papers with negative sentiment received 17.46 citations. It would appear that developing paper titles with negative sentiment affords a good way to get work cited within the academic integrity research field.

Conclusion and Recommendations

This paper represents the first study of its kind in the academic integrity research field, using the largest known data set of academic integrity research publications as its base. The analysis shows that academic integrity research is a field with rapid growth, but citations have been built upon publications with a negative concept of integrity. Both bigram analysis and sentiment analysis show a similar view of negative integrity, but with pockets of positive integrity shining through.

Many opportunities exist to take this research forward. Similar techniques can be applied to other fields, or to specialised subjects within the academic integrity area. The bigram technique has shown the existence of many long-tail keywords that are suitable for literature reviews and more detailed analysis. The sentiment analysis approach used is not specific to academic integrity and could be further optimised through the development of training data sets. In addition, it would be interesting to apply these techniques to paper abstracts and full papers to see if the same results hold. Academic integrity researchers and practitioners may find it useful to develop more NLP and linguistical analysis skills. Many of the techniques applied to research are already akin to those which can be applied to forensic investigation of student work to detect plagiarism and contract cheating (Ison, 2020; Johnson & Davies, 2020).

Although not an intended focus of the investigation, the data serendipitously revealed that there appears to be a question to be posed regarding the value of much academic integrity research. In the primary data set developed for this paper, 2854 out of 8507 papers (33.55%) have never received a single citation. In addition, threats to research integrity were observed when examining the data set. A 2019 paper was found published in three different journals by the same suspect publisher with only slight changes to the paper title and abstract. Paper citation cartels seem to be developing, with single papers having a large group of authors, each of whom then go on to cite as many papers as possible by members of the group. The effect seems to be an artificial bump up the citation metrics for all members. In a field like academic integrity, researchers also need to hold their own practices up to the highest standards.

There are issues that need to be addressed regarding what content should be placed in research repositories and how Google Scholar results are produced. Students can be referred to Google Scholar as a valid starting point for their own research, but not all search results are suitable for this purpose. One university repository contains an archive of blog posts by researchers, but these are now listed by Google Scholar as if they are academic papers. There are also many examples of contract cheating providers finding ways to have their content added to Google Scholar, complete with visible adverts. The promotional methods of the contract cheating industry have already been observed as being highly suspect (Lancaster, 2019) and this is providing yet another method through which students can be brought into their marketing funnel.

One of the biggest disputes in the academic integrity community surfaced continually throughout this paper. Is the best terminology to use in research “academic integrity” or “academic dishonesty”? Should researchers take the opportunity to introduce a more positive viewpoint of the field? There is much historical interest to the use of the term “academic dishonesty” but this term may no longer be necessary. Despite this, research papers that take a negative view of integrity, using terms such as cheating, dishonesty and misconduct, do drive an emotional response in a manner that integrity does not seem to do. Such papers then benefit from more citations and drive future research. It is something of a vicious circle.

This paper has demonstrated that it is possible to present the academic integrity research field using positive terminology. Several of the most prolific authors in the field are doing just that. More publications appear to be taking a positive integrity view than ever before. Emerging academic integrity researchers can and should be encouraged to model their approach on such papers and researchers. But further effort needs to be made by the academic integrity community to promote such papers and to show that research into positive integrity is possible, worthwhile and of value.

Perhaps then a move to purely talk about positive integrity is a step too far. As the sentiment analysis presented in this paper has demonstrated, the most recent trend in paper title construction has been a move towards titles devoid of positive or negative intention. Researchers in related fields talk about ethical neutrality. At the start of this paper, a quote from Barnes (1904) talked about social responsibility. Too often, academic integrity researchers are the same people who are the practitioners working on academic integrity in the classroom, teaching students and often awarding penalties for academic integrity breaches. Due to the nature of the research field, true independence of research from practice and teaching may be impossible. Aiming instead for neutrality as to how research in the academic integrity field is presented may then provide the best future solution for all concerned.