A Bayesian nonparametric approach for the analysis of multiple categorical item responses

https://doi.org/10.1016/j.jspi.2014.07.004Get rights and content

Highlights

  • Joint factor and cluster analysis of questionnaires with multiple categorical responses.

  • Joint inference on the number of factors and clustering of subjects.

  • Clustering borrows strength across subjects, improving estimation of the model parameters.

  • We employ Markov chain Monte Carlo techniques, including sampling of missing data.

  • Application to educational datasets and uncover hidden relationships between questions and educational concepts.

Abstract

We develop a modeling framework for joint factor and cluster analysis of datasets where multiple categorical response items are collected on a heterogeneous population of individuals. We introduce a latent factor multinomial probit model and employ prior constructions that allow inference on the number of factors as well as clustering of the subjects into homogeneous groups according to their relevant factors. Clustering, in particular, allows us to borrow strength across subjects, therefore helping in the estimation of the model parameters, particularly when the number of observations is small. We employ Markov chain Monte Carlo techniques and obtain tractable posterior inference for our objectives, including sampling of missing data. We demonstrate the effectiveness of our method on simulated data. We also analyze two real-world educational datasets and show that our method outperforms state-of-the-art methods. In the analysis of the real-world data, we uncover hidden relationships between the questions and the underlying educational concepts, while simultaneously partitioning the students into groups of similar educational mastery.

Introduction

In this paper, we develop a Bayesian Nonparametric model for the joint factor and cluster analysis of datasets where multiple categorical response items are collected on a heterogeneous population of individuals. Similarly as in conventional Bayesian probit and multinomial regression models (Albert and Chib, 1993), we assume that each categorical response outcome is a surrogate for a continuous unobserved latent variable. A Bayesian factor model is then assumed on the latent variables. With respect to common factor analysis as well as multidimensional item response theory (Reckase, 2009) approaches, we allow the number of underlying factors to be inferred directly from the data. Our approach is similar to that of  Rai and Daumé III (2008) and  Knowles and Ghahramani (2011), who consider a nonparametric prior on the number of latent concepts based on the Indian Buffet Process (IBP) proposed by  Griffiths and Ghahramani (2005). In addition, we employ a Dirichlet Process prior  Ferguson, 1973, Ferguson, 1974 to cluster subjects into groups characterized by similar factor structures. Clustering allows us to borrow strength across subjects, therefore helping in the estimation of the model parameters, particularly when the number of observations is small. We also discuss mechanisms for the imputation of missing data. We employ computationally efficient Markov chain Monte Carlo (MCMC) methods to provide tractable inference for the model parameters of interest.

Surveys and questionnaires with ordinal categorical responses are employed in many fields to gather relevant feedback information on individual attitudes toward a set of items. For example, in marketing, surveys are used to improve product delivery and pricing against competition. Here, we consider a specific application to personalized learning, which has recently emerged as an independent research topic within the field of education (Stamper et al., 2007, Li et al., 2011, Murray et al., 2004). Our model leverages the fact that knowledge in a given subject can typically be decomposed into a set of potential principles to learn, termed concepts. For personalized learning, in particular, statistical methods are widely employed to enhance student learning in a course, namely by assessing how well students understand educational concepts (learning analytics), and exploring the relationships between the test questions and the concepts (content analytics). Rigorous statistical methods for both learning and content analytics enable targeted feedback to learners, their instructors, and the content authors (Kulik, 1994).

Given the number of individuals typically surveyed and the number of topics assessed per individual, it is often of interest to reduce the dataset to an interpretable set of highly-informative variables. For example, in assessing tests or homework questions, a few skills or factors may play a role in understanding why certain learners succeed at some problems while failing at others. In turn, this information may be useful to predict future learner outcomes as well as diagnosing learner misconceptions. Traditionally, Item Response Theory (IRT) methods have been used to relate the individual responses to a set of latent traits, which summarize the non-observable characteristics of the person. However, many commonly used IRT approaches rely on the simplifying assumption that the relationship between each latent trait and the probabilities of correct response to a test item can be represented as a continuous mathematical function of a single or limited set of parameters (Reckase, 2009). For example, the popular Rasch model can be described as a two-parameter logistic model categorizing both users and items (Rasch, 1993). While this model works satisfactorily if the set of items is restricted to a limited domain, its performance suffers when items of mixed-type are introduced, such as test questions that span multiple academic disciplines.

The Bayesian modeling approach we propose allows increased flexibility with respect to current methods for analyzing educational data. In particular, we obtain joint estimation of (i) associations among questions and concepts, (ii) learner concept knowledge profiles, and (iii) underlying question difficulties. Current methods for analyzing educational data typically perform factor and cluster analyses separately, either to highlight different structures in the data or as part of two steps procedures. We show that performing factor analysis while clustering the population of interest into groups of individuals characterized by homogeneous patterns of underlying factors (i.e., groups of learners with comparable skill sets) improves the predictive performance of the model. Moreover, the assumption that all subjects are equally reliable (i.e., two students with the same concept mastery exhibit the same variability when answering questions) is commonly made in models for educational data. In contrast, by including a subject-specific precision parameter, we are able to obtain a more realistic representation of a student’s ability and to improve the interpretability of the results. Another key aspect of our model is in its flexibility to infer the number of concepts from the data itself. This has been previously unexplored in the literature on educational data. Finally, missing values are readily handled within our Bayesian paradigm. This allows us, for instance, to impute whether a learner would answer an unattempted question correctly or not.

The remainder of the paper is organized as follows. Details regarding the fully Bayesian model and prior distributions are given in Section  2. Section  3 presents our MCMC method for posterior inference and analysis. Section  4 presents the applications, including a simulation study and results from experimental data. Section  5 provides some concluding remarks. The appendix contains technical details regarding our implementation.

Section snippets

Hierarchical Bayes model

In this section, we develop a modeling framework for joint factor and cluster analysis of datasets where multiple categorical response items are collected on a heterogeneous population of individuals. We start by introducing a latent factor multinomial probit model. Then, we discuss prior constructions that allow inference on the number of factors as well as the clustering of subjects into homogeneous groups of relevant factors. We also discuss prior distributions for the other model parameters

Posterior inference

In this section we briefly describe the sampling algorithm for posterior inference, then discuss identifiability issues and ways to obtain posterior estimates of the parameters of interest.

Experiments

Here we assess the performance of our approach on simulated data as well as on real-world educational datasets.

Conclusions

We have proposed a Bayesian Nonparametric model for the joint factor and cluster analysis in datasets where multiple categorical response items are collected on a heterogeneous population. Our fully Bayesian method employs two nonparametric priors, for learning the number of latent variables K and for learning the number of subject defined clusters L from the data. By means of simulations, we have shown that the additional structure imposed by our model provides improved accuracy with respect

MCMC details

We provide details of the MCMC algorithm for our Bayesian infinite factor model. Given the observations, W, we obtain inference for the parameters of interest using a combination of Gibbs sampling and Metropolis–Hastings updates.

  • 1.

    Update forW: We need to include possible missing values in W. Let W˜ij represent a missing answer for learner i at question j with a corresponding latent variable Y˜ij. Then, the likelihood can be split into observed and unobserved data, p(Y)=i,jΩobsBern(Wij;Φ(Yij;0,

Acknowledgments

This work was supported by the National Science Foundation under Cyberlearning grant IIS-1124535, the Air Force Office of Scientific Research under grant FA9550-09-1-0432, and the Google Faculty Research Award program.

The authors would like to express their gratitude to the Chairman, JAC, IISER Pune, for sharing the educational data, as well as Divyanshu Vats for insightful discussion regarding this dataset.

References (40)

  • T. Ferguson

    Bayesian density estimation by mixtures of normal distribution

  • L.M. Reder

    Strategy selection in question answering

    Cogn. Psychol.

    (1987)
  • S.V. Stehman

    Selecting and interpreting measures of thematic classification accuracy

    Remote Sens. Environ.

    (1997)
  • G. Adomavicius et al.

    Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • J.H. Albert et al.

    Bayesian analysis of binary and polychotomous response data

    J. Amer. Statist. Assoc.

    (1993)
  • Ando, T., Bai, J., 2014. Asset pricing with a general multifactor structure, J. Finance Econom....
  • D. Blackwell et al.

    Ferguson distributions via pólya urn schemes

    Ann. Statist.

    (1973)
  • C.M. Carvalho et al.

    High-dimensional sparse factor modeling: applications in gene expression genomics

    J. Amer. Statist. Assoc.

    (2008)
  • W. Eric et al.

    Spatio-temporal modeling of legislation and votes

    Bayesian Anal.

    (2013)
  • M. Escobar et al.

    Bayesian density estimation and inference using mixtures

    J. Amer. Statist. Assoc.

    (1995)
  • T.S. Ferguson

    A Bayesian analysis of some nonparametric problems

    Ann. Statist.

    (1973)
  • T. Ferguson

    Prior distributions on spaces of probability measures

    Ann. Statist.

    (1974)
  • Z. Ghahramani et al.

    Bayesian nonparametric latent feature models

  • T. Griffiths et al.

    Infinite latent feature models and the indian buffet process. Technical Report, GCNU TR 2005-001

    (2005)
  • P.R. Hahn et al.

    A sparse factor analytic probit model for congressional voting patterns

    J. R. Stat. Soc. Ser. C Appl. Stat.

    (2012)
  • R. Henao et al.

    Bayesian sparse factor models and dags inference and comparison

  • V. Johnson et al.

    Ordinal Data Modeling

    (1999)
  • D. Knowles et al.

    Nonparametric bayesian sparse factor models with application to gene expression modeling

    Ann. Appl. Stat.

    (2011)
  • A. Koriat et al.

    The combined contributions of the cue-familiarity and accessibility heuristics to feelings of knowing.

    J. Exp. Psychol. Learn. Mem. Cognit.

    (2001)
  • J.A. Kulik

    Meta-analytic studies of findings on computer-based instruction

  • Cited by (0)

    View full text