Introduction

Many countries around the globe have adopted a dual vocational education and training structure (e.g. Australia, Austria, Belgium, Canada, Denmark, Finland, France, Germany, Ireland, the Netherlands, Portugal, Sweden, Switzerland). The defining feature of this kind of vocational education and training (VET) is the idea of equipping apprentices with both practical and theoretical knowledge by combining company-based training programs provided by the private sector, with a school-based component, usually provided by the public sector, leading to qualifications in nationally recognized occupations. Advocates claim that this kind of VET structure is particularly aligned to the needs of the labor market (Krekel and Walden 2016). Some studies even claim a positive effect on youth unemployment rates, as well as on economic competitiveness, growth and stability (e.g. EU 2018; Hanhart and Bossio 1998; Hanushek 2012; McIntosh 1999). However, despite their perceived importance for education and economic policy, VET actors internationally struggle with issues of quality assurance and quality improvement (Le Mouillour 2017). For instance, Negrini et al. (2015) have shown that training conditions vary widely in relation to different industries, companies and professions. Different training conditions might have lasting effects on VET outputs such as the acquisition of vocational competence, professional and personal development processes and, finally, the social integration of young adults. Therefore, quality research into the assurance of learning conditions might be the most important factor for successful implementation of dual VET, and a prerequisite to achieving its claimed benefits at the individual and economic levels. However, studies addressing VET learning conditions frequently reveal alarming results, especially in regard to the in-company training component: insufficient mentoring and instructions, insufficient feedback, inadequate cooperation with the learning venue, or a lack of equipment (e.g. Ebbinghaus et al. 2010; Virtanen and Tynjälä 2008) are just some issues raised. This is problematic because the workplace plays a crucial role in dual vocational training. Moreover, for many apprentices it reflects their first long-term contact with the working world (Rausch and Schley 2015).

Against this backdrop, there is growing interest in the identification and evaluation of key characteristics of in-company training conditions. To get insights into training companies, many researchers use a questionnaire design – despite the related disadvantages of retrospective surveys (Rausch 2013; Tourangeau 2000). In the vast majority of cases the focus is on the learner’s perspective, and apprentice surveys are used (Böhn and Deutscher 2019; Rausch 2012). As Tynjäla (2013, p. 15) states, this is in line with a constructivist view of learning, where ‘(…) the presage factors do not affect the learning process directly(, but) rather through the learner’s interpretation (e.g. Prosser and Trigwell 1999; von Glasersfeld 1995)’.

However, the findings of these different studies are only partially comparable; this impedes aggregation. A key problem is that many different features of in-company training conditions are operationalized, and priorities set, in different ways. Various test instruments for the operationalization of training conditions in the VET system do exist but unfortunately, they are only rarely coordinated. In more recent years, a large number of test instruments have been designed, redesigned and adapted. However, there is still a lack of adequate test instruments that would enable the reliable detection of training conditions in VET (Velten and Schnitzler 2012). Some of these new test instruments are derived from test instruments used to measure workplace learning conditions in an employee context: for instance, the Job Diagnostic Survey (JDS) of Hackman and Oldham (1974) or the Work Design Questionnaire (WDQ) of Morgeson and Humphrey (2006). Others were especially designed for the VET context – for instance, the MIZEBA of Zimmermann et al. (1994) or the IBAQ of Velten and Schnitzler (2012). Existing test instruments differ in length and level of detail, but most importantly with regard to their respective content. Moreover, it is particularly problematic that studies only rarely report item and scale analysis results. Hence, the measurability, verifiability and ultimately the comparability of findings regarding in-company training conditions is impeded.

This context is the starting point for the research presented here. It was the aim of our research to assemble and organize all existing test instruments in the VET context, to give an overview of the breadth of VET learning quality research and, on the basis of this overview, to develop a validated questionnaire to measure the workplace characteristics of dual VET. The VET-learning quality inventory (VET-LQI) reflects the substantive heterogeneity of existing test instruments in various content areas in dual VET contexts, and offers validated single items and short scales for these content areas. The distinctive feature of the test instrument presented here is that it summarizes and reflects previous research efforts and can therefore be considered a rich potential source of information for measuring the quality of VET learning conditions. By offering short scales for diverse content areas it could serve future researchers to analyze in-company training conditions more comprehensively, given the limited testing time and to integrate and compare their results with respect to a general theoretical framework (Tynjälä 2013).

In order to realize this endeavor, we took a two-step study approach: First, we started by collecting and categorizing all existing apprentice survey instruments in the area of VET learning quality, then through a qualitative meta-synthesis generating an overview of the state of operationalization of VET learning quality research (study 1). As a second step, we then used the synthesized item pool to extract and test short scales for all identified content areas of VET quality research in a German context (study 2).

Theoretical Model: Quality of Workplace Learning in the Dual VET Context

In recent years, a growing research effort has been put into the identification and description of central quality aspects of learning in VET. The term ‘quality’ in the context of workplace learning in VET is however characterized by a lack of conceptual clarity (e.g. Van den Berghe 1997). Harvey and Green (1993) state that the conception of quality depends on perspective and usage and that the term can be used to describe processes as well as to evaluate results. Moreover, quality always means a comparison between conditions and the normative expectations of interest groups. It can then be interpreted as the extent of goal achievement (Mirbach 2009; Ott and Scheib 2002). From this point of view ‘quality’ can be basically operationalized. But as goal orientations vary with interest groups, reference to the respective goal perspective is necessary, and subsequently the scope of interpretation is thereby restricted.

Synthesizing the current state of research according to Klotz et al. (2017), we propose to define VET quality as the subjective perception of characteristics of vocational training that are conducive to certain outcomes. At this point, the key characteristics of Tynjälä’s (2013) 3-p model – a general model to describe learning processes – are drawn upon: While different conceptions of the term prevail (e.g. Blom and Meyers 2003; Harvey and Green 1993), a broad consensus has emerged in the workplace learning community in regard to using a ‘three pillar approach’ when describing in-company vocational training quality aspects (e.g. Seyfried et al. 2000; Tynjälä 2013; Visser 1994). For instance, Tynjälä (2013) refers to Biggs’ (1999) 3-p model by distinguishing presage, process and product factors, and combines this approach with a strong emphasis on the individual’s – in this case the learner’s – perspective (Fig. 1): The learner factors, as well as the learning context, are represented within the presage (input) dimension. The process dimension covers the description of workplace learning characteristics, including the structure and performance of work tasks or the learning individual’s interaction with others. Finally, within the product (output) dimension, all learning related outcomes are summarized; these are mainly focused on the individual’s personal and professional development (Tynjälä 2013).

Fig. 1
figure 1

The 3-p model of workplace learning (Tynjälä 2013, modified from Biggs 1999)

While there is broad consensus about the distinction between input, process and output dimensions in general, the key characteristics specifying those dimensions vary greatly within existing studies. The absence of a common understanding regarding the specific content structure of input, process and output factors leads to a lack of conceptual clarity regarding VET quality and is reflected inter alia by the large number of test instruments that have been developed in this context that have little reference to each other.

As noted above, it was the aim of our research to assemble and organize all existing test instruments in the VET context and to develop a questionnaire measuring workplace characteristics in dual VET contexts. This focus is due to the fact that dual VET has some special characteristics, compared to workplace learning in general (Brooker and Butler 1997; Fuller and Unwin 2003; Raemdonck et al. 2014; Virtanen et al. 2009; Virtanen and Tynjälä 2008), especially in regard to three features: the role of the vocational school, curriculum-based work tasks, and training personnel. Usually, dual VET system conditions are governed by a legal framework. On the basis of this legal framework, learning is designed as a cooperative endeavor of vocational schools and private or public sector companies. As two organizational actors are integrated in the training process, there is a strong need for collaboration and coordination between the two learning venues. Hence, many apprentice surveys focus on aspects of learning venue cooperation (e.g. Brooker and Butler 1997; Dwyer et al. 1999; Ebner 1997; Feller 1995; Fink 2015; Heinemann et al. 2009; Keck et al. 1997; Nickolaus et al. 2015; Prenzel et al. 1996; Ulrich and Tuschke 1995; Virtanen et al. 2014; Walker et al. 2012). A second characteristic of dual VET systems relates to vocational curricula. Workplace learning is generally described as being mostly informal (Brooker and Butler 1997; Eraut 2004a, b; Marsick and Watkins 1990; Virtanen et al. 2009). In VET, on the other hand, the learning process of an apprentice at work is normally accompanied by a curriculum, structuring a domain in work tasks that have to be administered to apprentices in order to develop vocational competence. Moreover, as a third characteristic, dual VET is typically supported by training personnel who give instructions and ensure a certain level of formal organization. Further, because acquiring a formal occupational qualification is entailed, there is a need to assess the performance of an apprentice from time to time (Virtanen et al. 2009). This is why the design of work tasks and the role of the training personnel are both of high importance for in-company vocational training (Brooker and Butler 1997; Fuller and Unwin 2003; Virtanen et al. 2009). All three characteristics underline that workplace learning in the context of dual VET reflects a certain degree of formalization. Against this backdrop, apprentice surveys often focus strongly on questions related to those aspects (e.g. Beicht et al. 2009; Brooker and Butler 1997; DGB 2008; Ebner 1997; Ernst 2016; Feller 1995; Gebhardt et al. 2014; Heinemann et al. 2009; Hofmann et al. 2014; Keck et al. 1997; Koch 2016; Nickolaus et al. 2015; Prenzel et al. 1996; Velten and Schnitzler 2012; Virtanen et al. 2014; Zimmermann et al. 1994). As the goal of our research was to develop a comprehensive apprentice survey in the context of dual VET, we focused on instruments used in this context – though there is of course a significant thematic overlap to general workplace learning instruments. Further, the dual VET study is limited to learning at the workplace (in-company characteristics) rather than school-specific quality characteristics.

Study 1: Overview of Previous Apprentice Surveys and their Validity

Method

Given the large number of studies in this research area, their heterogeneous designs and conceptions of quality, a qualitative meta-synthesis (also referred to sometimes as qualitative meta-analysis), seemed particularly suitable for collecting and categorizing all existing apprentice survey instruments in the area of VET learning quality, as it allows the systematic and full integration of research results (Paterson 2012). Qualitative meta-synthesis does however represent a rarely chosen form of analysis (Eisend 2014; Fricke and Treinies 1985; Glass et al. 1982). Following Lipsey and Wilson (2001), similarly to conducting a quantitative meta-analysis, one integrates the findings of several studies to develop an integrative overall result. In this case however, the database was qualitative in nature. For our study, the qualitative database consisted of the test items used in former apprentice surveys. In accordance with the process logic of meta-synthesis (e.g. Jensen and Allen 1996), the first two steps consist of (1) literature research and (2) literature selection. In our case a third step had to be added: this concerned dealing with the (3) collection of test instruments. The core of a qualitative meta-synthesis then, consists of the inductive determination of categories: (4) Item analysis and categorization. Finally, an integrative model of categories was built as a result of this methodological process.

Literature Research

For the VET context, a review of studies using apprentice surveys was conducted (Böhn and Deutscher 2019). This collection of studies, items and scales served as a starting point for the development of the VET-LQI (see study 2). In order to summarize the state of research into VET quality in apprentice surveys, we started out with a systematic literature search in eight databases (Business Source Premier, Deutscher Bildungsserver, EconLit, Education Resources Information Center (ERIC), Fachportal Pädagogik, Literaturdatenbank berufliche Bildung (LDBB), Social Sciences Citation Index (SSCI), Taylor & Francis), not differentiating in regard to publication type, profession, industry, language or country. The literature research included 21 German (Ausbildungsabbruch, Ausbildungsqualität, Ausbildungszufriedenheit, Berufsausbildung, betriebliche Ausbildung, betriebliche Ausbildungssituation, betriebliche Lernaufgaben, duale Ausbildung, duales System, Lehrling, Befragung, Fragebogen, Inventar) and English (Apprenticeship, On the job learning, VET, Vocational Education and Training, Workplace Learning, Work-based Vocational Training, Questionnaire, Survey) terms, which were combined in different ways. Thus, more than 13,000 search results were generated (including repeat results).

Literature Selection

After a detailed review and evaluation of the search results, all those studies were eliminated that (1) were only theoretically or conceptually founded, that used qualitative measurements or quantitative measurements but included no written survey (e.g. observation studies or interviews), (2) aimed at generating general assessments regarding the dual VET system without referring to the apprenticeship of the individual being questioned, (3) dealt with apprenticeship models that are not part of the dual system – meaning apprenticeships that do not integrate both a classroom based and a workplace based component and that therefore do not necessarily include practical experiences (including instructions and mentoring within a company), (4) focused solely on the classroom based instead of the workplace based component of the apprenticeship, (5) focused on a point of view other than the apprentices’ – for instance, the perspective of training personnel or vocational teachers, (6) were written neither in English nor in German. On the basis of this literature search, in combination with the literature selection criteria, 89 studies were deemed relevant. By retracing the references, an additional 23 studies were added.

Collection of Test Instruments

For a number of studies, the underlying test instrument was neither included in an appendix nor retrievable from supplemental materials or online resources. In such cases, we wrote to the authors asking for the test instruments to be provided. After waiting for a return time of eight weeks (return rate: 33.3%), in sum we had test instruments for 63.1% of all the studies. All those test instruments that were still not available after the expiry of the return time were not considered in the analysis. Similar applies to questionnaires that were incomplete or in a language other than English or German. In preliminary work for the analysis of test instruments, those questionnaires were identified that had been used in more than one study. This was necessary to avoid multiple analyses, which would have distorted the results. Finally, the literature research, selection and acquisition of test instruments provided the following results: (1) 112 studies were substantively relevant, but the underlying test instrument was available only for 71 of these. (2) One test instrument was available in two different versions, which were analyzed separately. (3) One study used a test instrument in a language neither English nor German. Therefore, it was not considered within the analysis. (4) In the case of three studies, only incomplete versions of the test instrument were provided. Therefore, they could not be considered within the analysis. (5) In the case of 37 studies we had no access to the test instruments at the expiry of the return time. (6) From those studies that provided the test instrument, fourteen were in English, the majority in German. (7) By checking multiple uses, 43 different test instruments could be extracted.Footnote 1

Item Analysis and Categorization

The inductive determination of categories was conducted on the basis of Mayring’s (2004) qualitative content analysis, including (1) generalization, (2) selection and (3) bundling, in order to connect and synthesize the test instrument items. First, all items were collected in tabular form, by excluding those questions dealing only with school-based aspects of the VET. Every single item was categorized separately on the basis of specific contents, and classified within a categorical system. Therefore, items with identical or similar contents were grouped: For example, the items ‘Have you previously held a full-time job?’ (NCVER 2000), ‘Have you previously undertaken a pre-apprenticeship in the same industry area?’ (Walker et al. 2012) or ‘Do you have previous work experience?’ (Virtanen and Tynjälä 2008) were grouped under the keyword ‘personal background’. Subsequently, keywords were merged to categories, allowing a reasonable summary of contents. The grouping of items and the categorical summary were independent of positive or negative wording and different answer scales. It must be noted that the grouping of content-related items corresponded only to some extent, to descriptions and summaries within the scales of the original test instruments. Compared to the original studies, the results of this analysis deviate in part, either from the grouping of certain items in specific categories or the naming of certain categories. A codebook for the keywords and categories was developed. The entire analysis was carried out twice by one of the two researchers in tandem with a third person (intracoder reliability = .984, intercoder reliability = .926).

Results

Via this method, in total, 3356 items of the 43 test instruments were analyzed and classified into 30 categories in eight content groups. As we focused on workplace characteristics, a reduced version of the category system (covering 2343 items) was used (Fig. 2) – a supplement to Tynjälä’s (2013) 3-p approach (Böhn and Deutscher 2019).

Fig. 2
figure 2

Integrative category system based on a qualitative meta-synthesis of test instruments used in apprentice surveys in the context of dual VET

The results of this qualitative meta-synthesis served as a starting point for the subsequent analysis, and the development of an integrative test-instrument. As the aim was to reflect prior research efforts, first the user frequency of items and categories was determined – showing how many test instruments traced back to a certain category, and how many items could be assigned to each category, thereby referring to and extending the results of the qualitative approach to the analysis of test instruments identified by Böhn and Deutscher (2019). As a result, significant differences regarding the nominal and substantive design of the 43 test instruments were revealed (Table 1).

Table 1 User frequency and number of items per category, content area and dimension

It can be noted that on average, less than 50% of all test instruments fell back to categories in the content area ‘learning environment’. Those that did so mainly focused on the category ‘usefulness of learning venue cooperation’; slightly more than half of all test instruments covered the ‘vocational training framework’. There were big differences in the usage of categories from the content area ‘work tasks’, with a focus on ‘overload’. Only one third of all test instruments, approximately, used questions related to the content area ‘social interaction’, with the focus being on the apprentice’s ‘functional involvement’. Test instruments differently reflected the content area ‘pedagogical mediation’, most of them concentrating on ‘mentoring’ and ‘personnel and instructions’. VET outputs were also covered very differently, ‘overall assessment and satisfaction’ being the category chosen most often. Categories that covered by far the largest selection of items were ‘vocational training framework’, ‘professional competence’, ‘overall assessment and satisfaction’, and ‘future prospects and career aspirations’ (≥ 180 items each). There was much less variety with items in the content areas ‘work tasks’ and ‘social interaction’, and in some output categories.

Aside from their nominal and substantive heterogeneity, it is especially problematic that only a minority of studies reported validation results for their test instruments (Baethge-Kinsky et al. 2016; Dietzen et al. 2014; Fieger 2012; Hofmann et al. 2014; Lee and Polidano 2010; NCVER 2000; NCVER 2008; NCVER 2012; Nickolaus et al. 2009; Nickolaus et al. 2015; Prenzel and Drechsel 1996; Prenzel et al. 1996; Rausch 2012; Ulrich and Tuschke 1995; Velten and Schnitzler 2012; Velten et al. 2015; Virtanen and Tynjälä 2008; Virtanen et al. 2009, 2014; Wosnitza and Eugster 2001; Zimmermann et al. 1994, 1999). With only a few exceptions, the scales of these test instruments did however show satisfactory to excellent levels of internal consistency (e.g. Fieger 2012; Rausch 2012), and the factor analyses that were carried out in some cases, did confirm the model fit (e.g. Zimmermann et al. 1994). As the quality aspects of some test instruments are described in several studies, only ten of 43 test instruments, each of which focuses on selected aspects of the general model, can be considered to have been psychometrically validated.

The following conclusions can thus be drawn: A large number of test instruments for the VET context already exist – theoretically, researchers could make use of more than 3000 items. However, the recourse to any existing survey is limited, as the majority of studies either focus on a small range of selected categories and/or do not report the validation results of their scales. Further, the lack of a comprehensive and reliable test instrument for the analysis of in-company training conditions in VET must be noted. A test instrument that does reflect all characteristics identified in the category system is presented in this paper hereafter. A distinctive feature of this questionnaire is that whenever possible it refers back to existing items and scales. It therefore reflects foci of a longstanding and diverse research tradition to analyze in-company training conditions in dual VET systems.

Study 2: Design and Validation of the VET-Learning Quality Inventory (VET-LQI)

In the following section the development and design of the VET-LQI are set out. A first version of the test instrument, with 166 items, was pretested in 2017 in three vocational schools in the German state of Baden-Wuerttemberg. The sample consisted of 393 apprentices, associated with 15 different commercial VETs. This pretest led to some helpful adjustments, especially in regard to item reduction and wording. In particular those items were removed that caused difficulties in understanding or that did not improve the reliability of their respective scale, while otherwise being dispensable in terms of representing content validity adequately. This was the starting point for a shorter version of the VET-LQI, with 139 items, which is presented hereafter. The original version of the test instrument was in German, and was tested in three other vocational schools in Baden-Wuerttemberg in spring 2018. The structure of the questionnaire, the data, the item and factor analysis, and the results are presented subsequently.

Structure of VET-LQI

The VET-LQI is a synthesized test instrument that, on the basis of previous research efforts, reflects learning relevant quality aspects, with a focus on the key characteristics of in-company training conditions. Therefore, it covers aspects of those categories that were identified in the qualitative meta-synthesis described above. The construction of new scales relies on a combination of efficient and frequently used items, together with a balance of newly developed items. In cases where existing scales and items showed little validated development, unduly answer scales or occupation-specific formulations, either adaption to a standard format or the creation of new items was necessary. In sum, 82 items could be accepted in full or in essence (Table 4). All other items were (re-)formulated and designed on the basis of the category’s underlying codebook and, as it was the aim to design a test instrument that can be used in different occupational contexts, in phrases referring to respective occupations. Whenever necessary, the formulation of items indicated an additional wording such as ‘in my company’ or ‘in my department’ to clarify that apprentices were asked to focus on the in-company component of VET, instead of school-based aspects. Some items also explicitly indicated whether a question related to either the skilled occupation or the training company. This differentiation was particularly obvious in those scales representing the categories ‘premature termination of contract’, ‘career choice’, ‘vocational identity’, ‘operational identity’ and ‘future prospects and career aspirations’. With the exception of five scales (‘demographical factors’, ‘biography’, ‘academic performance’, ‘application process’ and ‘company framework’), the wording of items enabled respondents to answer on a 7-point Likert scale (1 ‘totally disagree’ – 7 ‘completely agree’). Additionally, respondents could refuse to reply to every single item by choosing ‘I do not want to or cannot answer this’. The test instrument was completely anonymized. No information was gathered that would enable researchers to draw conclusions as to the identity of individual persons, school classes, vocational schools or training companies. The VET-LQI was presented in a paper-pencil mode and contained 139 items in 31 scales. They were all derived from the category system – except the scale ‘academic performance’, which is content-related but which was intentionally separated from other questions dealing with the ‘personal background’ of the apprentice (Fig. 2).

First, a German version of the test instrument was designed. Then, every single item was translated to English. This translation process was carried out by two researchers and rechecked by a native English speaker. Hence, the VET-LQI is available in two languages (see Appendix). The results of the test of the German version are presented below.

Data Collection and Sample

The estimated time for answering the questionnaire was 45 min; the majority of apprentices finished in 30 min. The sample consisted of 428 apprentices (N = 427 after data editing), aged between 16 and 37 (mean: 20.5). 233 female and 194 male apprentices were surveyed, around 50% of them being in their first year of training, another 30% in their second year and approximately 20% in their third year of training. Data were collected in seven commercial VETs; the distribution is given in Table 2.

Table 2 Sample distribution

Table 3 presents weighted values for the sample in comparison to the statistical population of the seven training occupations analyzed. For both groups, t-test and chi2-tests showed no significant differences in regard to age (p = .426), gender (p = .586) and education (p = .100).

Table 3 Sample representativeness

First Item Analysis Results

Turning to the item analysis results, it has to be noted that those scales covering framework conditions (‘vocational training framework’ and ‘company framework’), as well as two output scales (‘completion and final exam’ and ‘career choice’) remained unconsidered. Their content could either be reformulated or complemented on a very individual basis – depending on the use and context of intended studies, especially regarding questions covering personal details of the respondent, and the vocational training framework conditions. Moreover, those scales that primarily represented formative rather than reflexive measurement scales were also excluded from further analysis (Bollen and Lennox 1991; Diamantopoulos and Winklhofer 2001). This was especially so for questions related to the framework conditions and the output scales indicated above, which were unlikely to meet the requirements of factor analysis, as they did not describe reflexive theoretical concepts. For the remaining 99 items and 22 scales, a first aim of the item analysis was to shorten the questionnaire by identifying unsatisfactory items.

In the remaining 22 scales, 15 items were eliminated after a first examination, six because of a combination of low discrimination power (< .3, Ebel and Frisbie 1986) and the potential to improve the internal reliability (items 028, 029, 066, 069, 087, and 139). Nine items were excluded either because of a low scale correlation or because they appeared problematic for respondents in respect of their content or wording, as reflected in a high ratio of missing values of > 5% (items 032, 044, 047, 055, 062, 082, 096, 128, and 135). The internal consistency of the majority of scales was appropriate; 19 scales even reached good or excellent values (Cronbach’s alpha > .7) while two scales were at least acceptable (Cronbach’s alpha > .6, DeVellis 1991; Nunnally 1978; Robinson et al. 1991). The ‘relevance of tasks’ scale rated particularly poorly (Cronbach’s alpha: .447). This was caused by a low discrimination power on two out of three items in this scale (item 059, with a discrimination power of .289, and item 061r, with a discrimination power of .217). The ‘training requirements and ability level’ scale was just below a good level of internal consistency (Cronbach’s alpha: .684). It contained an item with a low level of discrimination power (item 070, with a discrimination power of .291). The ‘overall assessment and satisfaction’ scale, with a Cronbach’s alpha of .657, was also just below a good level. The internal consistency could be improved by excluding item 124, which had a low discrimination power of only .234. However, for content-related reasons item 124 should be maintained.

Summarizing the findings, by excluding four additional items (items 033, 052, 088 and 124) the internal consistency of the scales could be further improved. Moreover, it has to be stated that there were five other items that generated more than 5% – set as a critical threshold – of refused or missing answers (items 026 and 064 with 6%, item 090 with 9%, and items 091 and 129 with 8% missing values). Nevertheless, content-related reasons indicated maintaining them initially, while keeping those twelve critical items in mind. In sum, 84 out of 99 items were kept for further analyses. The results of the item analyses are presented in Table 4.

Table 4 Item analysis results

Confirmatory Factor Analysis and Adaption of VET-LQI

Based on the category system in Fig. 2, a confirmatory factor analysis (CFA) was performed using the software packages R and lavaan (Rosseel 2012). CFA assumes reflective models, meaning that changes in the hypothetical construct cause changes in the indicator variables. Hence, all analyses were based on factor loadings and correlations between indicators and factors (e.g. Anderson and Gerbing 1988). When reflective models are adopted, all indicator variables describing one hypothetical construct are assumed to be highly correlated.

For CFA, first the free parameters within the base model have to be identified (Rosseel 2012). In this case, there were 22 latent variables (factors), each with an individual number of observable variables (indicators): in total, 84 items. Hence, 84 factor loadings needed to be estimated, as well as 22 covariances between the factors. Additionally, the residual variances of the indicators, as well as the variances of the factors had to be estimated – another 106 free parameters. In total, there were 212 free parameters. However, as for each latent variable, the factor loading of one indicator variable was set to one,Footnote 2 another 22 parameters were fixed. Hence, 190 free parameters had to be estimated. Second, we used MLM – the robust version of Maximum Likelihood (ML) – as an estimator for the purposes of conducting the CFA (e.g. Curran et al. 1996; Gold et al. 2003). MLM is based on the Satorra-Bentler scaled Chi2 statistic, which yielded good results in robustness studies, especially when indicator variables deviated strongly from normality (Boomsma and Hoogland 2001; Chou and Bentler 1995; Schermelleh-Engel et al. 2003). Third, global goodness-of-fit indices for this model are reported on the basis of the recommendations by Schermelleh-Engel et al. (2003), as well as Matsunaga (2010), including Chi2/df ratio, p value (Chi2), RMSEA, SRMR, CFI and TLI (Bentler 1990; Bentler 1995; Bollen 1989; Browne and Cudeck 1993; Hu and Bentler 1995; Kaplan 2000; Matsunaga 2010; Schermelleh-Engel et al. 2003; Steiger 1990; Vandenberg 2006). However, it has to be noted that because of the high model complexity, due to the 22 latent variables, only RMSEA and SRMR can be considered interpretable, as their measures are insensitive to model complexity (Bentler 1995; Browne and Cudeck 1993; Kaplan 2000; Schermelleh-Engel et al. 2003; Steiger 1990). All other indicators deteriorate with sample size and the number of latent variables (Bentler 1990; Bollen 1989; Hu and Bentler 1995; Matsunaga 2010; Schermelleh-Engel et al. 2003; Vandenberg 2006). For the interpreting of RMSEA, the values reported by Browne and Cudeck (1993) were applied: RMSEA > .10 not acceptable, .08–.10 mediocre fit, .05–.08 acceptable fit and < .05 good fit. For the interpreting of SRMR, the values reported by Hu and Bentler (1995) were applied: SRMR < .10 acceptable fit, < .05 good fit.

Generally speaking, and following Matsunaga (2010, p. 108, citing to Kenny and McCoach 2003): ‘(I)t seems noteworthy that the number of items being analyzed in a given CFA is negatively associated with the model’s goodness of fit. In other words, (...) the more the items, the worse the model fit’. Hence, not surprisingly, an examination of the local goodness-of-fit indices for the 22-factor solution indicated that improvements should be possible (Table 5). First – on an item-factor-level – single factor loadings should be relatively high and unambiguous, meaning that indicator variables with low factor loadings in general, or high factor loadings regarding more than one latent variable, might be problematic. Unfortunately, there is no exact threshold for defining ‘low factor loadings’ or ‘high cross-loadings’. In this article, a threshold for single factors loadings is set to a minimum of | ± .4 |.Footnote 3 Cross-loadings appear problematic if differences in factor loadings are below | ± .2 |Footnote 4 (Matsunaga 2010). Second, significances of factor loadings should indicate that every indicator variable is affected by the latent variable within the population. Within this dataset, p values showed significant results for all items. Third, those indicator variables with low commonalities, or high values for uniqueness (indicating the proportion of variance that cannot be explained by the factors) should be treated with caution. For our purpose, the threshold was set to a value of above .6 for uniqueness, which applied to two variables.Footnote 5 Fourth, high correlations (> .8) between variables were checked. Fortunately, these were not found between variables of different factors; otherwise they could possibly have pointed to multicollinearity.

Table 5 Global fit indices for CFA: Base model

As the next step, several items were eliminated from the base model, owing to the following criteria: high uniqueness values (> .6), low factor loadings (< .4) and low discriminatory power (< .3). Three items appeared particularly problematic.Footnote 6 For content-related reasons, as two belonged to a common scale, initially only the worst item was eliminated. Hence, items 059 and 124 were excluded (Table 6, model 3). Then, two more items were eliminated due either to low discriminative power values or to high loadings on more than one factorFootnote 7 (model 4). Finally, another four items were excluded, for different reasons but with non-critical effects on the underlying scale in regard to number of items or reliabilityFootnote 8 (model 5).

Table 6 Global fit indices for CFA: Different models in comparison

Summarizing the findings, the differences in fit indices were evaluated, to compare the models. Even when applying a strict threshold of < 2 (Byrne 1991), the Chi2/df ratio indicated an acceptable fit for the base model, as well as all variations after item elimination – nevertheless, and not surprisingly, the Chi p value stayed significant in all cases. CFI and TLI did not reach an acceptable fit (Hu and Bentler 1999; Russell 2002); this is however expectable as, in contrast to RMSEA and SRMR, they punish model complexity, which was extensive for our estimated model. However, model comparison indicated an improvement for the non-hierarchical models after the elimination of eight items (see model 5 compared to models 1 to 4). RMSEA already indicated a good fit for the base model (model 1), and this improved slightly for models 3, 4 and 5. SRMR followed this logic, reaching its best value for model 5. Based on these fit indices, and on being the most parsimonious, model 5 was chosen to represent the best overall fit (Fig. 3). It has to be noted that fit indices also indicate an acceptable model fit when there is the assumption of a 22-factor solution with one higher-level factor (model 6).

Fig. 3
figure 3

Item selection factor loadings based on model 5, referring to the 3-p model (Böhn and Deutscher 2019; Tynjälä 2013)

Concerning the reliability of single items and scales, the following considerations should be noted: First, with the item selection in model 5, Cronbach’s alpha decreased slightly in the case of seven scales.Footnote 9 As these changes in internal consistency were rather small, the aim of reducing the total number of items took precedence. Second, regarding the ‘overall assessment and satisfaction’ scale, item elimination caused an improvement in the Cronbach’s alpha.Footnote 10 In the case of the ‘in-company learning’ scale, item 027, due to its inconsistent wording, was split into two single questions: ‘Workplace learning in my company is characterized by the usage of different materials’ and ‘Workplace learning in my company is characterized by the usage of different media’. Not only the structural validity of VET-LQI, but the convergent and discriminant validity also were checked. First, the average variance extracted (AVE) criterion, indicating convergent validity, yielded good results (> .5) for 16 of 22 factors, while 5 factors were just slightly below (> .4) (Fornell and Larcker 1981). Factor 7 – already performing poorly in reliability analysis – yielded an unsatisfactory value of .373. Discriminant validity, as the degree to which measures of different traits are unrelated, was assessed by the Fornell-Larcker criterion, as well as by analyzing the corrected correlations between the factors of the CFA. Thirteen factors met the Fornell-Larcker criterion (Fornell and Larcker 1981). The correlation matrix (Table 7) between the latent variables further explains that nineteen factors had moderate correlation values, but factors 19, 20, and 22 had high intercorrelations of > .8 (e.g. Evans 1996) that must be considered problematic.

Table 7 Correlations between latent factors

Conclusion

Despite the possible disadvantages of retrospective surveys in general (Rausch 2013; Tourangeau 2000), the vast majority of researchers still use questionnaire designs to gain insights into quality aspects of VET. Moreover, previous research activities in VET have only partly focused on in-company training factors, primarily concentrating on the learners’ factors. Particularly because of this focus and, additionally, the usage of different test instruments to operationalize training conditions, aggregating the findings regarding VET quality is still difficult. Therefore, the aim of this article was to present a validated selection of items and scales synthesizing the existing research regarding in-company training conditions that could be used in the context of dual VET. More than 3000 items were identified and categorized, and a selection were transferred into an integrative test instrument. This questionnaire may contribute to future research activities by providing a time-saving selection of validated items and scales for the analysis of in-company training conditions, and linking them to a general theoretical framework for vocational learning (Tynjälä 2013).

It was also our aim to present short scales that allow for the broad analysis of in-company training conditions, given the limited testing time. Those can, however, also be reused independently of one another in future research that aims to take a more specific focus. Especially in regard to the length of the questionnaire, a 5-point Likert scale could also be sufficient. Furthermore, it might be helpful to rename the middle category, to enable participants to express a neutral position (‘uncertain’ / ‘unentschlossen’ instead of ‘partly agree’ / ‘trifft teilweise zu’). With the help of an item and factor analysis, useful indications on a parsimonious design of the test instrument could be gained. The total item number of 139 was reduced to 116 that collectively cover all the workplace characteristics scales existent in measurement instruments for dual VET. Analyses at both item and factor levels showed satisfactory results for the test instrument VET-LQI. All scales, except ‘relevance of tasks’, reached acceptable or good internal consistency values (.677 < Cronbach’s alpha < .893). While reliability analysis, as well as convergent and discriminant validity analyses overall yielded satisfactory results, the ‘relevance of tasks’ factor also lacked convergent validity and should therefore be adapted for future use. In addition, with respect to discriminant validity, factor 19 (‘overall assessment and satisfaction’), factor 20 (‘vocational identity’) and factor 22 (‘future prospects and career aspirations’) had high intercorrelations of > .8. Their combined use cannot be recommended, as this would likely cause empirical problems in future causal studies (e.g. multicollinearity). However, their high correlation seems theoretically plausible. This also explains why some factors did not meet the strict Fornell-Larcker criterion (factors 1, 7, 10, 12, 14, 15, 19, 20, and 22). Although these showed only negligible deviations of < .1, additionally, four of them contained at least one item with high cross loadings (factors 1, 7, 12, and 19). The combination of both indicators might be indeed problematic with regard to discriminant validity. Moreover, as our analyses were based on a limited sample size, had a regional (German) focus, and were confined to seven commercial occupations in the dual VET context, the instrument is yet to be validated for different samples and occupational fields, in future surveys, and of course with the use of the English translations of the scales. If necessary, occupation-specific adaptions and additions of items would be helpful in respect of item formulation, so that apprentices can more easily relate to the instrument. Our results are further limited by the fact that a test of other hierarchical models as part of the factor analysis was not possible, as the sample size was not sufficient for this kind of analysis. Future studies might show whether higher-order solutions yield superior fit-statistics.

Conceivably, the instrument could be adapted for longitudinal use, by dividing it according to the input, process and output dimensions and collecting data at the very beginning of VET, during VET (preferably multiple times) and at the very end of VET, or at least very shortly thereafter. The VET-LQI by its very design, reflects a process-oriented structure, particularly fruitful for such longitudinal research designs, where the aim is to identify relations between several inputs, processes and vocational training outputs. Such a longitudinal research design would then also allow for an assessment of measurement invariance. Moreover, some scales used within the VET-LQI could give an indication of how to design surveys focusing on the perspective of other actors in the VET context, e.g. training personnel or vocational teachers. Especially when different perspectives are to be compared – as has been the aim in at least some studies in the past (e.g. Krewerth et al. 2008; Pineda-Herrero et al. 2015; Saboga 2008; Walker et al. 2012) – this might be a worthwhile endeavor. Finally, it might also be possible to use the presented items outside of the traditional VET context, e.g. to survey dual students who are studying at universities of applied sciences and working part-time. Undoubtedly, certain items would then need to be adapted and tested again for validity purposes.