Bachelor’s degree student dropouts: Who tend to stay and who tend to leave?

doi:10.1016/j.stueduc.2021.100999

Studies in Educational Evaluation

Volume 70, September 2021, 100999

https://doi.org/10.1016/j.stueduc.2021.100999 Get rights and content

Highlights

•
The percentage of lost credit vouchers is the most important variable for the classification.
•
Starting from the second semester, pre-entry attributes become irrelevant for classification.
•
The main pre-entry attribute related to leaving after the first semester is the time gap between high school and university.
•
Decision trees provide accurate and interpretable models.

Abstract

Factors of students’ dropout can be studied either by surveys among students or by analyzing data the university collects. In the work reported in this paper, we analyzed data known about students at the time of admission as well as data about the students' study achievements collected on a semester basis. Using data about students who enrolled in the academic year 2013/14, we created several data mining models to predict who will finish their studies successfully and who will not. Our results show that the key factor is the percentage of lost credit vouchers in the most recent semester. The pre-entry attributes have only a very small impact. We also created association rules of different types to find characteristics of students who did not successfully complete the first semester of study. Here, the factor that mainly increases the probability of a failure is the time gap between secondary and tertiary education.

Introduction

Large dropout rates of university students can be observed all over the world. According to Aulck, Nambi, Velagapudi, Blumenstock, and West (2019), "Each year, roughly 30 % of first-year students at US baccalaureate institutions do not return for their second year". According to OECD (2013), the European dropout rate is 30 %. Norton and Cherastidtham (2018) report that "Nearly a quarter of a million students will start a bachelor degree in Australia in 2018, but more than 50,000 of them will leave university without getting a degree". Analytical materials of the Ministry of Education, Youth and Sport of the Czech Republic show that about 25–35 % of students who enrolled for bachelor's degree studies at Czech universities between the years 2003 and 2015 ended their studies during the first year (MSMT, 2016). So regardless of the country, the numbers are very similar: a significant percentage of students who enroll in study toward the bachelor degree leave the university before obtaining the degree.

As pointed out by Larsen, Kornbeck, Kristensen, Larsen, and Sommersel (2013), student dropout has consequences at the student, university, and society levels. The consequences at the student level are psychological (feeling of individual failure and loss of self-confidence) and economic (waste of personal resources). The consequences at the university level are economic, organizational, and academic. As many universities are funded according to the number of students, losing students means losing money. Organizational consequences are related to scheduling teachers, classrooms, and laboratories for students who might not enroll for the forthcoming semester. Academic consequences are related to the fact that, in the case that students are leaving by their own free will, the university can lose potentially successful students and graduates. At the society level, a high dropout rate affects both the educational system and labor market in a given country.

Complex dropout and retention models have been proposed to explain this phenomenon. Tinto's model of institutional departure, grounded in psychology and sociology, understands university as a composite of two systems: academic and social (Tinto, 1993). A student needs to be integrated into both of these systems to successfully finish his/her study. Tinto identifies family background, skills and abilities, prior schooling (considered as pre-entry attributes), institutional experience with both academic and social systems, integration into both academic and social systems, and external commitments as key factors affecting the goals and commitments of a student during his/her study, thus affecting the dropout decision. Bean's student attrition model is based on an analogy of student dropout with employee turnover in an organization (Bean, 1983). Like that of employees, students' satisfaction is also affected by organizational determinants. Bean models the dropout decision using four groups of variables: background, organizational determinants, satisfaction, and institutional commitment (Aljohani, 2016).

It is important for universities to understand the factors that lead to student success and student dismissal, and researchers can help with this understanding. Larsen et al. (2013) distinguish between two research approaches. The first approach, called "German" relies heavily on the dropout and retention theories. Thoroughly organized questionnaire surveys cover questions related to all aspects of dropout as described in the theory. The second approach, called "British" is more data-driven; university register data are analyzed with the aim to create classification or prediction models. While the first approach goes more in-depth, the response rate of the survey is usually low. Besides this, the information about the eventual dropout of a student is not known when he/she answers the questionnaire. In the second approach, on the contrary, large samples with the information about the dropout are available but cover only a limited range of dropout-related factors.

The "German" approach is represented by detailed studies that report various factors related to the theoretical findings that influence the likelihood of a student dropping out. Sagenmuller (2018) identifies financial problems, poor secondary school preparation, poor identification with the study field, conflict with work and family commitment, increasingly failing courses, lack of quality time with teachers, de-motivating school environment, and lack of student support as key factors. Norton and Cherastidtham (2018) discussed factors related to personal and family background, academic performance, the field of education, and engagement with the study. The EU report on dropout and completion in higher education in Europe presents a comparative study based on an extensive review of literature and policy documents on study success, a survey of national higher education experts, and eight in-depth country case studies including the Czech Republic (Vossensteyn et al., 2015). A survey of attitudes and motivation of students at Czech universities was carried out as a part of a broad, cross-country European comparison (the Eurostudent VI project) in 2016. The studied factors related to dropout were age, gender, subjective assessment of study achievements, the form of study, type of study program, and type of school (Fischer et al., 2016).

The “British” approach is represented by the area of educational data mining, a sub-field of data mining that aims "to detect patterns in large collections of educational data that would otherwise be hard or impossible to analyze" (Romero & Ventura, 2013). Data mining in general can be used for segmentation, concept description, classification, prediction, dependency analysis, or deviation detection in many areas (Chapman et al., 2000). Statistical methods (e.g. logistic and linear regression, or k-nearest neighbor clustering), as well as artificial intelligence and machine learning methods (e.g. decision trees, decision rules, association rules, neural networks, Bayesian methods or SVM), are applied to find useful knowledge in the data. These methods can be grouped into two categories. Symbolic methods, e.g. decision trees or rules, represent the found models in a human-understandable way, so both accuracy and interpretability are important. Sub-symbolic methods, e.g. neural networks or SVM, are not readily intelligible by humans, so only the accuracy of the models matters. In educational data mining, the data collected by the university about their students are used to create models to predict students' success (here both symbolic and sub-symbolic methods can be used) or to look for important characteristics related to students dropping out (here only the symbolic methods make sense).

Dekker, Pechenizkiy, and Vleeshouwers (2009) used decision trees, Bayesian networks, logistic regression, and rule-based methods to predict dropout of freshmen at the Electrical Engineering department at Eindhoven Technical University. They considered both pre-entry data and data from the first year of study and created models (1) from pre-entry data only, (2) from study data only, and (3) from all data. Nicholes and Reimer used logistic regression to analyze the impact of the first-year composition on student persistence. They explored the pathway through composition-1 and composition-2 courses to graduation with the aim of early recognition of students needing some support. The data were collected at one university in the US (Nicholes & Reimer, 2020). Rovira, Puertas, and Igual (2017) used different machine learning techniques for the early prediction of student dropout as well as for the prediction of grades. For the first, classification task the authors used logistic regression, a naive Bayes classifier, SVM, random forest, and AdaBoost. For the second, prediction task, linear regression and support vector regression were used. The data collected at the University of Barcelona consists of grades in all courses taken by a student. Abu-Oda and El-Halees analyzed data about grades in particular courses of bachelor-degree computer science students at Alaqsa University. They used these data to classify dropouts using decision trees and a naive Bayes classifier and to find relationships between grades using association rules mining (Abu-Oda & El-Halees, 2015). Pal and Pal used decision trees to analyze data about socio-demographic characteristics of students at Veer Bahadur Singh Purvanchal University in Jaunpur, India. They tried to predict the students' performance with the aim to identify students who need special advising or counseling. (Pal & Pal, 2013). Montmarquette, Mahseredjian, and Houle (2001) describe a bivariate probit model created to differentiate between persistence and dropout using longitudinal data about enrolment of students at the University of Montreal. Aulck et al. (2019) applied logistic regression, k-nearest neighbors, random forests, SVM, and gradient boosted trees to predict graduation or non-completion of students using data from the University of Washington.

Our research fits into the data-driven (i.e. “British”) approach. The data used in our work come from one Czech university (University). In our previous work, we focused on data known about the students at the time of their admission (Berka, Marek, & Vrabec, 2019). In the work reported in this paper, we also analyze the data about the students' study achievements collected on a semester basis. We aim at differentiating between bachelor's students who successfully completed their study at the University and those who did not. We used decision trees, random forest and logistic regression to be able to predict who will finish their studies successfully and who will not (either being dismissed or leaving by his/her own choice). We also perform association rules mining to find interesting characteristics of students who did not complete the first semester/year.

In the rest of the paper, we describe the data used, define the data mining tasks, describe the experiments we carried out, discuss the results, and show future research directions. As we follow the CRISP-DM methodology (Chapman et al., 2000) in our work, we organize the main part of the paper according to its basic steps: business understanding, data understanding, data preparation, and modeling.

Section snippets

Business understanding

The University offers bachelor, master, Ph.D., and MBA study programs mainly oriented on the field of economy, but also offers several programs on computer science, quantitative methods, and foreign affairs. The University has about 14,000 students and about 550 teaching staff. The academic year at the University starts on September 1st and is organized in two semesters, winter semester and summer semester. The standard duration of study toward the bachelor's degree is three years, the standard

Results and their evaluation

To address Research Question 1, we created tree-based classification models and compared them (in terms of classification accuracy) with logistic regression and two models that reflect the quality of used data. The Research Question 2 is addressed by the dependency analysis using association rules.

Discussion

As our work fits into the data-driven “British” approach, we are limited by the available registry data that monitor mainly the study progress. Yet we used as many pre-entry attributes described in the dropout theories as possible. Since we focused on interpretable models, we could analyze which of these attributes contribute to the dropout. The results of both classification and dependency analysis can thus be used to characterize students who will successfully end their bachelor's study and

Conclusions and future research

We report some results of data mining analysis of the data about students who enrolled in their bachelor study at the University in the academic year 2013/2014. We have formulated two research questions that can be turned into classification and dependency analysis tasks. The data used for classification consist of pre-entry attributes and data about study achievements in the first four semesters, the data used for dependency analysis consist only of pre-entry attributes. We argue in favor of

Acknowledgements

This paper was processed with contribution of long term institutional support of research activities by Faculty of Informatics and Statistics, Prague University of Economics and Business.

References (25)

C. Montmarquette et al.
The determinants of university dropouts: A bivariate model with sample prediction
Economics of Education Review
(2001)
G.S. Abu-Oda et al.
Data mining in higher education: University student dropout case study
International Journal of Data Mining & Knowledge Management Process
(2015)
R. Agrawal et al.
Mining associations between sets of items in massive databases
O. Aljohani
A comprehensive review of the major studies and theoretical models of student retention in higher education
Higher Education Studies
(2016)
L. Aulck et al.
Mining university registrar records to predict first-year undergraduate attrition
Proc. 12th Int. Conf. on Educational Data Mining
(2019)
P. Berka et al.
Modeling Students Dropout Using Statistical and Data Mining Methods
Proc. 22th. Conf. Applications of Mathematics and Statistics in Economics
(2019)
J. Bean
The application of a model of turnover in work organizations to the student attrition process
Review of Higher Education
(1983)
L. Breiman
Random forests
Machine Learning
(2001)
L. Breiman et al.
Classification and regression trees
(1984)
P. Chapman et al.
CRISP-DM 1.0 Step-by-step data mining guide
(2000)

G.W. Dekker et al.

Predicting students drop out: A case study

International Conference on Educational Data Mining

(2009)

J. Fischer et al.

Eurostudent VI. Základní výsledky šetření postojů a životních podmínek studentů vysokých škol v České republice

(2016)

Cited by (12)

Predicting student dropouts with machine learning: An empirical study in Finnish higher education
2024, Technology in Society
This study uses three machine learning models to predict student dropouts based on students' transcript, demographic, and learning management system (LMS) data from a Finnish university. The contribution of this research lies in 1) comparing the relative importance of LMS (Moodle) data with transcript and demographic data in degree program dropout prediction, 2) examining the predictive importance of different data features monthly as a function of time from enrollment, hence extending the prior end-of-semester research to a midsemester analysis, and 3) measuring the prediction performance of the models monthly. The results identify “accumulated credits” (transcript) the “number of failed courses” (transcript), and “Moodle activity count” (LMS) as the most important features, suggesting LMS has significant predictive power and should be considered alongside transcript and demographic data when predicting degree program dropouts. Moreover, we visualize how these factors' importance and prediction performance vary over time, revealing general longitudinal trends and fluctuations within semesters. Finally, we elaborate upon this study's contributions before highlighting its limitations.
Analysis of machine learning strategies for prediction of passing undergraduate admission test
2022, International Journal of Information Management Data Insights
Citation Excerpt :
In the edited nearest-neighbor under-sampling method ENN, misclassified samples were cut out according to their nearest neighbors (Fotouhi, Asadi, and Kattan, 2019). The approach excludes all noisy and borderline examples (Beckmann, Ebecken, and Pires de Lima, 2015), (Berka and Marek, Sep. 2021). ENN was often used to exclude samples from all categories (Fotouhi, Asadi, and Kattan, 2019), which occurred for this study.
This article primarily focuses on understanding the reasons behind the failure of undergraduate admission seekers using different machine learning (ML) strategies. An operative dataset has been equipped using the least significant attributes to avoid the complexity of the model. The procedure halted after obtaining 343 observations with ten different attributes. The predictions are achieved using six immensely used ML techniques. Stratified K-fold cross-validation is mentioned to measure the expertise of proposed models to unsighted data, and Precision, Recall, F-Measure, and AUC Score matrices are determined to assess the efficiency of each model. A comprehensive investigation of this article indicates that the resampling strategy derived from the combination of edited nearest neighbor (ENN) and borderline SVM-based SMOTE and SVM model achieved prominent performance. Additionally, the borderline SVM-based SMOTE and the Adaboost model performs as the second-highest performing model.
Lessons learned from the student dropout patterns on COVID-19 pandemic: An analysis supported by machine learning
2024, British Journal of Educational Technology
Model for the Prediction of Dropout in Higher Education in Peru applying Machine Learning Algorithms: Random Forest, Decision Tree, Neural Network and Support Vector Machine
2023, Conference of Open Innovation Association, FRUCT
Educational Data Mining Utilization to Support the Admission Process in Higher Education Institutions: A Systematic Literature Review
2023, 2023 International Conference on Cyber Management and Engineering, CyMaEn 2023
Dropout in Higher Education and Determinant Factors
2023, Lecture Notes in Networks and Systems

View all citing articles on Scopus

View full text

Bachelor’s degree student dropouts: Who tend to stay and who tend to leave?

Highlights

Abstract

Introduction

Section snippets

Business understanding

Results and their evaluation

Discussion

Conclusions and future research

Acknowledgements

Economics of Education Review

Data mining in higher education: University student dropout case study

International Journal of Data Mining & Knowledge Management Process

Mining associations between sets of items in massive databases

A comprehensive review of the major studies and theoretical models of student retention in higher education

Higher Education Studies

Mining university registrar records to predict first-year undergraduate attrition

Proc. 12th Int. Conf. on Educational Data Mining

Modeling Students Dropout Using Statistical and Data Mining Methods

Proc. 22th. Conf. Applications of Mathematics and Statistics in Economics

The application of a model of turnover in work organizations to the student attrition process

Review of Higher Education

Random forests

Machine Learning

Classification and regression trees

CRISP-DM 1.0 Step-by-step data mining guide

Predicting students drop out: A case study

International Conference on Educational Data Mining

Eurostudent VI. Základní výsledky šetření postojů a životních podmínek studentů vysokých škol v České republice