1 Introduction

In the year 2011 the first prototype of Oculus Rift was designed, marking the beginning of a new era of virtual reality (VR). Over the next several years, Oculus Rift was joined by similar VR systems, the most notable being HTC Vive, PlayStation VR, and Valve Index. Unlike most devices that came before, mid-2010s commercial VR systems showcased high-quality features whilst remaining within a relatively affordable price range. The public was intrigued by the promises of unprecedented levels of immersion and novel ways of interacting with the virtual world. In fact, the results of recent studies confirm what is considered to be VR’s selling point—not only is VR more appealing [1] compared to non-immersive platforms, but playing a game in VR appears to improve perceived presence [2], as well as the overall satisfaction [3], enjoyment [3], and happiness [2]. And yet, as of March 2021—a decade after the initial Oculus Rift design—the percentage of VR headset owners among Steam users sits at a fairly low 2.60%, with only a slight increase of 0.09% compared to the previous month [4].

To fully realize the disruptive potential of immersive technologies such as VR, as well as augmented reality (AR), hardware manufacturers and content developers need to address the issue from the perspective of the end user. Based on the definition given in the Qualinet White Paper on Definitions of Quality of Experience [5], QoE for immersive media, such as VR, is defined as: “the degree of delight or annoyance of the user of an application or service which involves an immersive media experience. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the user’s personality and current state.” [6]. This paper brings together key findings from literature and standards related to QoE assessment for VR, and provides readers with a systematic overview and guidelines on key aspects to be considered when planning and executing QoE assessment studies. While many individual aspects addressed in this paper have already been discussed at length in various research publications (e.g., [7,8,9]), our paper is conceived as a comprehensive collection of valuable resources, recommendations and explanations of relevant concepts, tools and methods, to be used as a reference for scientists and industry experts looking to incorporate user testing into their research and development process. As such, it provides information that may be useful to researchers from various disciplines and presents methods suitable for testing interactive VR services regardless of their intended use.

Parés and Parés [10] discuss the distinction between the terms virtual environment (VE) and virtual reality (VR). According to their definition, VE refers to a static environment comprised of different content, geometry and static rules of the environment, while VR refers to a VE in action (i.e., experienced in real time). However, this paper will refer to the term virtual reality as defined by Aukstakalnis [11], who refers to VR as different display technologies (e.g., head-mounted display—HMD, computer-assisted virtual environments—CAVE) capable of generating sensations of immersion and presence inside a three-dimensional model or simulation, therefore creating a visual replacement of the real world. As such, VR belongs to what we refer to as Immersive Media Technologies (IMT). State of the art information on IMT and Immersive Media Experience (IMEx) has recently been presented in the QUALINET White Paper on Definitions of Immersive Media Experience (IMEx) [6]. A systematic literature review on the topic of immersive systems is provided by Liberatore and Wagner [12].

While VR systems can be used for viewing 360-degree videos (e.g., [13]), we highlight that this paper focuses on QoE for interactive applications. Considering that interactivity is defined as “the extent to which users can participate in modifying the form and content of a mediated environment in real time” [14], we consider interactive VR applications as being those which enable users to navigate and/or manipulate the virtual environment, instead of passively observing. Interactive VR applications share many similarities with conventional interactive applications, such as computer games, which means that papers and recommendations related to gaming QoE may serve as useful guidelines for designing VR applications and conducting VR studies. While further focusing on immersion and presence, what distinguishes VR from non-immersive platforms are issues of physical side effects (i.e., cybersickness) and general discomfort, which plague as much as 80% of users [15] and therefore present a serious obstacle to achieving not only high levels of user satisfaction, but, more importantly, providing a healthy and safe experience. Unfortunately, optimizing these factors proves to be especially problematic as presence and cybersickness appear to be negatively correlated [16]—by making application design choices aimed at improving perceived presence, developers are more likely to provoke cybersickness symptoms in users. This paradox highlights the importance of close examination of various factors that contribute to increasing perceived presence and decreasing cybersickness, as well as the importance of a thorough analysis of ways in which these factors relate to each other and contribute to the overall QoE score.

This paper focuses on perception-based methods of QoE assessment, i.e., methods based on testing human evaluators (participants). As explained in [17], participants in user studies may be presented with a test stimulus or multiple test stimuli, asked to interact with a system, and/or use the system in interaction with another person. Based on these experiences, users provide quantitative or qualitative subjective evaluations, which subsequently undergo statistical analysis. In addition to subjective measurements, researchers often employ objective methods of evaluation, such as different physiological, behavioral, and task performance measurements. Considering that QoE as a field of research focuses on the isolation of specific factors, perception-based QoE assessment may serve as a foundation for analysis of individual QoE elements, in addition to being a step towards ascertaining the value of QoE as a holistic concept. The overall QoE of a VR application, service, or system, can therefore be considered a combination of its elements, although their individual relationships can only be determined based on experimental data.

In this paper, we provide an analysis of different factors impacting QoE for interactive, synthetic, locally-rendered VR applications, with a special focus on aforementioned key elements of VR applications (i.e., presence, immersion, and cybersickness) and offer a comprehensive overview of published work. While the various addressed aspects of QoE assessment are applicable for both single and multi-user scenarios (referring to VR meetings and collaborative applications), we note that VR telemeetings imply an additional set of influence factors and QoE dimensions. While we do not address these aspects in detail, the interested reader is referred to the baseline draft of the ITU-T Recommendation for QoE assessment of extended reality (XR) meetings [18], developed in the scope of Study Group 12.

As we discuss the multidisciplinary field of QoE in the context of VR as a multimodal platform developed for a variety of use-cases, ranging from medicine and engineering to education and entertainment, our guiding principle in writing this paper is to adopt a thoroughly integrative approach. Considering the overarching similarities between the two (as discussed in [19]), we include notable studies and research guidelines stemming from the field of User Experience (UX) in addition to sources directly pertaining to the field of QoE. To further explain the reasoning behind discussed or proposed methodology choices (especially in relation to the issues of ethics and safety), we refer to research encompassing several disciplines, such as psychology, medicine, telecommunications, and computer science.

The structure of the paper (Sects. 28) loosely follows the set of questions referred to as the seven circumstances [20], as illustrated in Fig. 1, which presents the research questions we aim to address in each section. Using this format, we aim to systematically discuss different aspects of perception-based QoE assessment, such as different influence factors and QoE features, the knowledge of which can aid in defining the study objective, to ways in which researchers can provide a safe, ethical research environment, eliminate bias in participant choice and study design, and define appropriate study methodologies with respect to external and internal validity. Key challenges and ideas for future work are presented in Sect. 9, while our concluding remarks are presented in Sect. 10.

Fig. 1
figure 1

Paper structure and related research questions

2 The importance of QoE assessment for immersive VR applications

According to a survey conducted by Perkins Coie for the year 2020 [21], limited quality and/or quantity of available VR content is considered to be the biggest obstacle to mass adoption of VR, followed by inadequate user experience, and consumer and business reluctance to use AR/VR technology. These obstacles appear to be interconnected—e.g., if the goal is to encourage customers to increase the “demand” for content, developers and manufacturers would first have to improve the quality of the “supply”, which entails improvements regarding user experience. However, adapting the VR technology and content in a way that shows potential to significantly improve user experience is a complex challenge which requires a deeper understanding of a multitude of factors, especially considering that—given the number of different use-cases and stakeholders—there is no one-size-fits-all approach to VR hardware and software design.

2.1 Relevant stakeholders

With commercial VR technology still in the early stages of market penetration, cumbersome consumer grade VR solutions leave a lot of room for improvement. Different VR hardware manufacturers are attempting to compete by consistently adding new device features, such as eye-tracking technology (e.g., HTC Vive Pro Eye) or standalone headsets (e.g., Oculus Quest). A diverse selection of different I/O devices (e.g., body tracking technology, haptic devices) has begun to emerge on the market, as companies strive to develop a more natural way to interact with the VE. In addition to efforts invested towards improving single-user solutions, companies focused on gaming and entertainment applications may choose to focus on utilizing VR’s ability to create high levels of co-presence, as evidenced by Horizon Worlds—a social VR world which is in the user testing phase at the time of this writing. As with other multi-user experiences realized through less immersive platforms, social use of VR technology requires adequate network conditions in terms of available bandwidth and low latency. Additionally, a more relevant VR-related challenge for network service providers pertains to a shift towards split rendering, which utilizes edge cloud infrastructure, as well as the increased use of IoT sensors and actuators contributing toward a more immersive experience, enabled by the capabilities of 5G networks [22].

In the context of VR, the most obvious customer base can be found in gaming enthusiasts seeking a novel, more intense experience. However, VR has long been, and will continue to be, used for various purposes other than its most commonly mentioned use-case—as stated in Perkins Coie [21], aside from gaming and entertainment, immersive technologies (AR and VR) were expected to make a significant impact on the following sectors in the year 2020: healthcare and medical devices sector, education, workforce development and training, manufacturing and automotive industry, marketing and advertising, logistics/transportation, retail/e-commerce, military and defense, commercial and residential real estate, and tourism. The scope of possible VR uses requires thorough research, as each field comes with its own set of requirements in terms of content and input/output devices. However, some aspects and principles of VR design can be generalized across various use-cases and populations. Therefore, there is a need for highly specialized studies using specialized equipment and a target demographic, as well as for more generalized VR QoE/UX studies with a diverse range of participants.

2.2 Understanding user acceptance of virtual reality

To systematise the factors that influence the user to consider using or purchasing VR technology, researchers have developed appropriate technology acceptance models. Sagnier et al. [23] present a VR-adapted extension to the Technology Acceptance Model (TAM) [24]. The model describes the impact of different dimensions of user experience on perceived ease of use and perceived usefulness. Perceived ease of use was found to be significantly influenced by pragmatic quality, i.e., the usability and the utility of the product [25]. Perceived usefulness was found to be significantly influenced by stimulation (a hedonic quality that refers to “the individual’s pursuit of novelty and challenge” [25]) and personal innovativeness. Participants’ intention to use VR appears to be significantly increased by perceived usefulness, and significantly decreased by the severity of cybersickness symptoms, while a significant direct effect of presence has not been found, although it may pose an indirect influence by affecting other variables. A similar TAM-based model, focusing on VR hardware acceptance, is presented by Manis and Choi [26]. The model distinguishes between intention toward using VR hardware, and intention toward purchasing VR hardware. Unlike the model by Sagnier et al. [23], this model does not examine the influence of presence and cybersickness, but it does account for user-related factors such as age, previous experience, and the price they were willing to pay for the product. The authors discuss curiosity, perceived usefulness, and perceived ease of use.

According to Whalen et al. [27], QoE in a virtual environment can be maximized by increasing the feeling of enjoyment in users, making it easier for them to accomplish their goals in the context of the application, service, or system, and decreasing discomfort and/or stress. These aspects coincide with the factors presented in Sagnier et al. [23] and Manis and Choi [26], highlighting the connection between VR QoE, intention to use VR software and/or hardware, and the intention to purchase VR software and/or hardware. Therefore, by collecting user evaluations during/after usage of various VR services, realized using different VR systems, researchers gain a deeper insight into multiple variables (often referred to as influence factors) affecting user experience, and acquire knowledge regarding their mutual relationship.

3 Quality of experience: influence factors and key features

Fig. 2
figure 2

VR QoE influence factor categories (adapted from [28])

The Qualinet White Paper [5] defines influence factors (IFs) as traits exhibited by the system, service, application, or even users themselves, that may potentially influence QoE of the users of an application or service. Our concise overview of influence factors affecting the interactive VR experience is based on—but not limited to—the classification of influence factors for VR as presented in ITU-T Recomm. G.1035 [28] (Fig. 2).

3.1 Human influence factors

In terms of human (also referred to as user) influencing factors, researchers often choose to examine dynamic human factors, such as the current affective state of the user, as well as static human factors, which refer to the fixed traits of the participant (e.g., age, sex, etc.). With the common occurrence of VR-related discomfort being an impetus for further research, a high importance is placed on human IFs such as history of illness (e.g., migraine, motion sickness), as well as relevant factors related to vision and hearing. Additionally, previous history of technology use may greatly influence task performance, level of discomfort, and overall satisfaction with the used system. To facilitate comparison of these aspects based on user expertise, participants can be classified according to their general experience with interactive applications (e.g., games) or immersive technology, experience with a particular type/genre of application or, even more specifically, previous experience using a particular application. Considering that VR is still not widely adopted, it should be expected that test subjects may require more time to acclimate to new devices and make more requests for help and instructions [29]. Additionally, when using novel technology, users may perceive their experience as higher in quality due to their own increased interest levels [30]. Further explanations regarding the ways in which previous experience or expectations set by the previous theoretical knowledge of the system/service, may influence QoE, can be found in Sect. 4.1. While listed as influence factors in the ITU-T G.1035 recommendation, cybersickness and immersion may also be examined as QoE features, dependant on other human, system, and context factors. As such, we will describe them in more detail in Sect. 3.4.

3.2 System influence factors

Hardware Influence Factors: Unfortunately, current VR technology is riddled with ergonomic issues. For example, a greater size/weight of VR HMDs may be distracting and uncomfortable to some users and increase the overall physical workload required to interact with the system [31]. As a result of their limitations in terms of adjustability, certain commercial headset designs are not adapted to suit the dimensions of a significant percentage of the populationFootnote 1. Individuals who use visual correction aids are even more likely to struggle with adjusting the headset to suit their needs [32], especially in case of a system shared by multiple users, as is the case with QoE user studies. Additionally, original versions of contemporary commercial VR headsets have been tethered to the PC and dependent on external sensors, which entails various issues with setup, tracking [33], and cumbersome cables [34]. However, as of late, standalone versions have been appearing on the market (e.g., Oculus Go, Oculus Quest), offering greater mobility and easier setup at the expense of computing power. VR hardware manufacturers are starting to integrate eye-tracking technology into their HMDs, a feature that can not only be used as an assessment tool (e.g., [35, 36]), but also as a tool for optimizing user experience and service performance by enabling foveated rendering [37, 38] and fine-tuning user interaction with the VE (e.g., [39]). In general, input (e.g., controllers. gesture control, movement tracking ) and output modalities (e.g., headsets, haptic devices) play a significant role in user experience by greatly affecting different quality features. Thus, it is important to pay attention to the possible impact of different device characteristics, such as tracking quality (e.g., [33, 40]), latency (e.g., [41,42,43]), display quality (e.g., [44, 45]), and ergonomic design/fit (e.g., [45, 46]).

Network Influence Factors: Exploring the impact of networking factors (delay, jitter, bandwidth, packet loss) is currently especially crucial for VR applications centered around 360-degree video streaming (e.g., [47]), although networking issues may also cause significant issues for locally-rendered interactive networked VR applications (e.g., multiplayer games [48], teleoperation [49], or telepresence/collaboration applications). However, 5G and beyond networks are expected to be a disrupting force, revolutionizing the capabilities of immersive interactive VR as we know it. In addition to enabling split rendering, through significant improvements in network bandwidth, latency, and reliability, 5G and beyond networks provide the means for achieving hyper-realistic holographic telepresence. While VR in its current state mostly relies on audio-visual stimuli and body movement tracking to produce a high level of immersion, the significance of haptic technology is expected to increase with the emergence of 5G-enabled Tactile Internet (TI) [50].

Media/Coding Influence Factors: This group includes factors related to compression approaches used for encoding audio and video data, as well as other relevant types of information—e.g., point clouds. Aimed at facilitating efficient storage and network transmission, the factors discussed in this paragraph are generally more relevant in the context of 360-degree video (e.g., [51]) and cloud VR (e.g., [52]), compared to synthetic, locally rendered VR services, and are therefore mostly out of scope for this paper. Because of this, we will only briefly touch upon useful sources that may be of interest to readers. For example, Xu et al. [53] present a state-of-the-art overview of 360-degree video and image processing, which includes relevant information regarding perception, quality assessment, and compression methods. With respect to standardization efforts, ITU-T Recomm. P.919 [54] outlines subjective assessment methods for evaluating the QoE of short 360-degree videos. Details are provided on the characteristics of source sequences to be used, with a wide range of stimuli covering different spatial and temporal complexity, motion, and exploratory properties. Interested readers are further referred to the cross-lab quality assessment tests involving 360-degree videos reported by Gutierrez et al. [55], which were instrumental for the development of ITU-T Recomm. P.919. Among analyzed factors impacting audiovisual quality, the authors consider source content characteristics and uniform and non-uniform coding degradations. In terms of audio, interactive VR services require special consideration of different user movements and positions in relation to other sound sources and listeners positioned within the surrounding virtual space. A paper by Narbutt et al. [56] delves into spatial audio compression and its impact on subjectively-perceived quality. Readers interested in coded representations of immersive media, including not only immersive audio and 360-degree video, but also volumetric data (as discussed in [57, 58]), may refer to the ISO/IEC 23090 MPEG-I collection of standardsFootnote 2, which contain information on relevant formats, compression methods, quality metrics, implementation guidelines, and reference software.

Content Influence Factors: It is important to take into account different characteristics of the application used in a particular QoE study. In case of interactive applications, such as games, different genres/types can exhibit different levels of sensitivity to different kinds of impairment, such as latency, or produce different levels of immersion and discomfort. Even within the same genre/type, different applications may utilize different mechanics and interaction patterns, realized using different software implementations, which needs to be taken into consideration, as these differences may influence QoE and lead to different conclusions. Notable examples of aspects that are of interest to VR researchers include different characteristics of the avatar (e.g., [59, 60]) and the visual environment (e.g., [61]), implementation of the locomotion method (e.g., [62, 63]), narrative (e.g., [64, 65]), UI design (e.g., [66, 67]), etc. With regards to VR gaming, due to similarities with other virtual environments, many relevant content influence factors can be found in the ITU-T Recomm. G.1032 [68] which describes influence factors affecting gaming QoE.

3.3 Context influence factors

Following the discussion of content IFs, different ‘tasks’ performed by end users when evaluating QoE during VR use may be relevant to consider, such as tasks involving different interaction or locomotion techniques. Further, the actual social context is a relevant factor in case of multiplayer/collaboration applications. Arguably, it may be even more relevant for immersive applications compared to conventional platforms. In fact, in addition to an increase in perceived immersion [29, 69], VR multiplayer games may result in higher levels of empathy in users when compared to non-VR [69]. User experience may greatly differ depending on the duration and/or frequency of VR use, which impacts the formation of QoE, with the temporal development of QoE, including momentary, reflective, repetitive, and retrospective QoE, explained in further detail in Sect. 7. Physical environment may not be visible to the user immersed in a VE, but environmental variables may be distracting or facilitate the occurrence of cybersickness. Internal and external validity of the results are significantly affected by the setting of the study, i.e., whether it is situated in the field, or in a lab. A more detailed analysis on the impact of the physical context of the study can be found in Sect. 8.

3.4 QoE features

Fig. 3
figure 3

Quality features of gaming QoE (taken from [70]; based on [71])

A quality feature is defined as “a perceivable, recognized and nameable characteristic of the individual’s experience of a service which contributes to its quality” [72]. Generally speaking, as described in [5, 73], quality features can be classified on several levels: level of direct perception (e.g., brightness, contrast, flicker, color perception, loudness, sound localization), level of action (e.g., immersion, perception of space, perception of one’s own movements/motion within that space), level of interaction (e.g., responsiveness, naturalness of interaction), level of the usage instance of the system (e.g., learnability, intuitivity, ease of use, aesthetics), and level of service beyond the particular usage instance (e.g., appeal, usefulness, utility, acceptability). In the context of VR as an interactive, immersive, multi-modal medium, all examples mentioned above can be considered relevant features, but the extent of their individual contributions towards the overall QoE may vary depending on the particular type of VR service.

For example, Fig. 3 displays a taxonomy of gaming QoE features, as presented in ITU-T Recomm. P.809 [70], and based on Möller et al. [71]. However, while certainly transferable to VR, the taxonomy given in Fig. 3 involves some features that may not be relevant to non-gaming interactive VR applications (e.g., tension, challenge). Additionally, it does not include one of the most distinguishing characteristics of the platform—outside of depicted aspects, evaluating VR QoE/UX often includes examining dimensions such as discomfort and cybersickness, which happen as a result of the more physically intrusive nature of the platform, and may significantly degrade user experience. Indeed, aspects such as fatigue and discomfort have previously been recognized as some of the main features of QoE for certain media (i.e., 3D-TV [74]). In line with this, we would like to highlight the need for a general high-level taxonomy (or multiple service-specific taxonomies) of QoE features pertaining specifically to interactive VR and incorporating these aspects. Choosing to further focus on features that may be of a particular relevance in the context of VR (in comparison to less immersive media), in the remaining part of this section we present a more in-depth overview of immersion and presence, while an overview of physical symptoms is presented in Sect. 3.5.

3.4.1 Presence, immersion, and related concepts

When discussing user experience related to technologies such as AR and VR, it is important to define presence and immersion. Schuemie et al. [75] observed that, in literature, the term presence generally refers to a self-reported feeling of being transported to a virtual environment (i.e., experiencing a sensation of “being there”). As explained by Slater and Usoh [76], presence in the virtual world is the main factor that is specific to VR when compared to different types of media. The authors suggest that presence should be achieved through visual, auditory, tactile, and haptic sensations experienced by the subject.

Lee [77] defines presence as “a psychological state in which virtual objects are experienced as actual objects in either sensory or nonsensory ways”, and describes three types of presence. Physical presence refers to a state in which the subject experiences virtual physical objects as if they were actual physical objects. Self-presence refers to a state in which the subject experiences their virtual self (or virtual selves) as if it/they were the actual self. Social presence refers to a state in which the subject experiences virtual social actors (i.e., other humans and/or human-like intelligences) as if they were actual social actors. The definition of social presence encompasses situations that include both one-way and two-way communication, which distinguishes it from the definition of co-presence (i.e., the feeling of being present in a virtual space along with other humans, pertaining to social interactions with a mutual awareness; [78]), which does not include one-way communication.

With respect to immersion, multiple definitions have been proposed in the context of immersive technologies. Witmer and Singer [79] offer a definition of immersion as a psychological state in which a person perceives themselves as being inside of a virtual environment and interacting with it. Slater and Wilbur [80] take a different approach as they describe immersion in terms of hardware—more specifically, its ability to provide an experience of artificial reality that can be described as inclusive (referring to the hardware’s ability to block out physical reality), extensive (referring to the extent of independent sensory systems, such as sight, hearing etc., engaged by the hardware), surrounding (referring to the field of view), and vivid (referring to device characteristics such as display quality, resolution, and fidelity).

Cummings and Bailenson [81] performed a meta-analysis based on 83 studies, investigating the relationship between immersion (as a technical quality) and presence. While immersion had a moderate overall effect on presence, certain immersive features (tracking level, field of view, stereoscopy) were found to have a larger impact in comparison to other immersive features, such as image quality, resolution, and sound. These results highlight the importance of spatial cues and self-locating in the presence formation process [82, 83], compared to features such as realism and level of detail.

Several distinct terms and concepts are often considered when discussing presence and immersion (see [84]). For example, Witmer and Singer [79] provide the following definition of involvement: “a psychological state experienced as a consequence of focusing one’s energy and attention on a coherent set of stimuli or meaningfully related activities and events”. Weibel et al. [85] define absorption as “the capability to concentrate and block out external and distracting stimuli”, and consider it to be one of two independent subdimensions of immersion (the other being emotional involvement).

Slater and Sanchez-Vives [86] use the term embodiment to refer to a setup in which the virtual body coincides with the physical body of a user, the user sees the world from the perspective of the virtual body, and there are different types of synchronous multisensory correlation between the two. The visual characteristics of the virtual body (i.e., the avatar) significantly affect the user’s experience of the virtual environment. Compared to a generic avatar, embodying a personalized avatar was found to increase the sense of body ownership, as well as perceived presence [87]. Even the user’s behaviour, motor functions, and attitude have been shown to change in accordance with the visual characteristics of the corresponding virtual body. The explanations of this phenomenon are given in [88] (the Proteus effect) and [86] (body semantics).

In our brief summary of the aforementioned concepts, we have touched upon certain challenges in terms of wording and nomenclature. For example, although seemingly interchangeable, certain terms (e.g., social presence and co-presence) actually differ from one another in more or less subtle ways, while other terms have multiple (very different) definitions (e.g., immersion). Additionally, due to their relatively abstract, sometimes even vague, definitions, participants may struggle with reporting their subjective perception/evaluation of such features. Researchers are therefore advised to consider these issues when designing a study or comparing other work.

3.5 Physical side-effects

In addition to the previously mentioned QoE influence factors and features, the overall VR QoE is highly dependent on the level of physical discomfort experienced by the user. The common occurrence of physical side effects in participants happens due to the combination of multiple factors, including inherent characteristics of the human perceptual system, static human factors such as age or sex, or technical issues related to application and system design [11].

3.5.1 Cybersickness—definition and symptomology

Immersive technology users commonly experience a state known as cybersickness, which is often likened to motion sickness. Symptoms of motion sickness include emesis (nausea, retching, vomiting), different oculomotor disturbances (e.g., eye-strain, blurred vision) postural instability (also called ataxia) and vertigo [89]. The main distinction between motion sickness and cybersickness is the type of stimulation they tend to be induced by. The main cause of motion sickness is vestibular stimulation [90] (however, visual stimulation may also contribute [91]), while cybersickness can be provoked by visual stimulation alone. Aukstakalnis [11] provides a comparison between the two in terms of symptomatology. Both motion sickness and cybersickness may cause pallor, nausea, retching/vomiting, increased salivation, increased sweating, dizziness and headaches. In addition to the aforementioned symptoms, Aukstakalnis [11] lists fatigue as a common symptom of motion sickness, and apathy, disorientation, difficulty focusing and blurred vision as common symptoms of cybersickness. However, Mazloumi Gavgani et al. [92] conducted a study comparing symptoms of motion sickness caused by physical movement to symptoms of cybersickness caused by an immersive VR application, and found similarities between symptoms and autonomic changes induced by both types of simulation, leading to the conclusion that motion sickness and cybersickness are clinically identical.

In addition to discussing the differences between “cybersickness” and “motion sickness”, it is important to address the relationship between the terms “cybersickness” and “simulator sickness”. While they are often used interchangeably (or replaced by less common terms such as “virtual reality sickness” or “visually induced motion sickness”, e.g., [11]), and usually examined using the Simulator Sickness Questionnaire (SSQ [93]; see Sect. 6), they differ in terms of context and symptomatology. While the term “simulator sickness” originally refers to the type of discomfort experienced during use of military simulators, cybersickness comes as a result of exposure to VEs. Stanney et al. [94] explain that for simulator sickness, oculomotor symptoms are the most pronounced, followed by nausea and disorientation, while cybersickness results in comparatively higher levels of disorientation, followed by nausea, with oculomotor symptoms being the least prominent symptom group. Additionally, as measured by the SSQ, sickness induced by virtual environment systems results in significantly higher intensity for all three symptom groups compared to simulator sickness [94]. For the sake of consistency, we refer to the state of VR-induced discomfort as “cybersickness” throughout this survey paper, regardless of the exact term used in the cited research.

3.5.2 Factors contributing to cybersickness

Physiological factors LaViola Jr [95] lists three popular theories explaining physiological factors behind cybersickness:

  • Sensory Conflict Theory states that the main reason behind motion sickness, as well as cybersickness, is the conflict between the vestibular sense and the visual sense [96]; in case of cybersickness this conflict happens when a person perceives movement based on the information on the display, but their body is not actually moving in a way that is suggested by the visual stimulus.

  • Postural Instability Theory states that cybersickness is caused by an application/service/system forcing the user into a prolonged state of postural instability, meaning that they experience a state of “uncontrolled movements of the perception and action systems” which is not adequately minimized [97].

  • Poison Theory states that cybersickness is caused by an evolutionary mechanism which serves as protection against poisoning; a mismatch between different sensory input systems that happens during immersive application use is incorrectly interpreted by the brain as a symptom of poisoning, which triggers an emetic response in order to empty the stomach of toxic substances [98].

System factors An exploration on the impact of VR hardware maturity on cybersickness is presented in [99]. Specific factors that may contribute to the occurence of cybersickness are presented below, including factors listed by Aukstakalnis [11], Stanney et al. [15] and LaViola Jr [95]:

  • latency: the term latency refers to the delay that happens between an action performed by the user and the system’s subsequent reaction [100]; latency tends to cause a mismatch between what the user sees, and the proprioceptive sensations the user feels, therefore causing a sensory conflict which may lead to cybersickness.

  • incorrect interpupillary distance settings: if the lens of the HMD is not properly aligned with the eye, this may trigger the onset of cybersickness symptoms, especially eye-strain and headache [101].

  • optical distortion of scene geometry: to counteract the phenomenon of pincushion distortion caused by the optical design of the HMD lens, the image needs to be distorted in a way that is equal and opposite to the lens distortion (i.e., barrel distortion); however, this often does not compensate for different eye-lens alignments and subtle changes in eye position, which can lead to issues with depth perception and slight shifts in the perceived position of scene geometry [102].

  • flicker/frame rate: low frame rate increases the likelihood of flickering, which may cause issues such as eye-strain and nausea [103], although this depends on the user’s individual critical flicker fusion rate threshold [104].

  • position tracking errors: in addition to standard tracking errors, trackers used in VR systems may produce a jitter effect, i.e., they might move uncontrollably even if the user’s body part remains stationary; this is especially problematic in case of head movements as it shifts the perspective of the user; tracking errors may cause vertigo and difficulty focusing [105].

  • field of view: a wider field of view, while positively contributing to the sense of presence [106], makes flicker more noticeable [107] and increases cybersickness [108] due to the sensitivity of the peripheral visual system.

  • scene complexity: complex environments were shown to produce a significant increase in emetic response [15].

  • implementation of locomotion and camera movement: vection (perceived self-motion [96]), and especially changes in vection [109], increase cybersickness, while increasing the level of user control over body/camera movement reduces cybersickness [110, 111].

Human factors Various sources (e.g., [11, 15, 46, 95, 112,113,114]) list some of the individual factors that may be linked to a greater susceptibility to cybersickness, such as:

  • age: cybersickness susceptibility is highest for children between the ages of 2 and 12 [115]; following this early period, it decreases between the ages of 12 and 21 [96] and increases again after age 50 [116].

  • sex: female users have been found to be more prone to cybersickness [105, 113, 117].

  • ethnicity/race: Asian people have been found to be more prone to cybersickness [118].

  • bodily traits and history of illness: higher body mass index [15], previous experiences with motion sickness and cybersickness [119], migraine propensity [120], etc.

  • behavioral conditions and current state/mood: inadequate sleep [121], alcohol intake [121, 122], acute infections [121, 122], being made aware of/thinking about cybersickness [123], strong affective response to stimuli [114].

  • psychological traits and personality type: neuroticism [114], anxiety [124], low self-efficacy towards technology [113], low perceived sense of direction [113], lower preference towards adrenaline sports [99].

3.5.3 Adaptation

Wang and Suh [125] discuss different types of adaptation mechanisms (behavioral, cognitive and physiological adaptation) used for counteracting cybersickness. When users experience cybersickness, they tend to perform certain actions as a way to mitigate their symptoms, which is referred to as behavioral adaptation. For example, these actions may include taking a break, moving in a different way or adjusting the headset. Cognitive adaptation refers to the user’s choice to withstand the symptoms of cybersickness because they consider them to be a normal part of the experience. With continued and repeated use, users seem to acquire a certain level of resistance to cybersickness—this type of adaptation is referred to as physiological adaptation.

3.6 Digital eye strain and ergonomics

While the issue of cybersickness has already been researched and discussed in a large body of work, other types of VR-related discomfort have not yet garnered a lot of attention [126]. This gap in research is examined in a recent paper by Hirzle et al. [127]. When comparing the relevance of three symptom categories—referred as simulator sickness, digital eye strain, and ergonomics—in an online study conducted on 352 frequent VR users, the authors found that the majority of participants considered simulator sickness to be less relevant compared to both remaining categories.

Focusing on ocular symptoms, there are multiple potential causes for discomfort related to head-mounted displays, such as delay, flicker, resolution, image motion and binocular imperfections, which may be caused by optics (e.g., image blur, shift, rotation), the use of filters (e.g., luminance, color, contrast) or stereoscopic disparity [128]. As explained in [129], visual fatigue in VR is mostly a result of vergence-accomodation conflict (VAC) [130, 131], which occurs in case of a mismatch between the acccomodation distance and the rendered image distance, but may also happen because of motion (especially vibrational motion [132], which requires the gaze to be stabilised by the vestibulo-ocular reflex—VOR).

With regard to other ergonomic issues of VR systems, we highlight the following examples:

  • heat: heat development on the VR HMD may cause significant discomfort [45] and increase sweating [127]; the uncomfortable feeling of increased body heat may also come as a result of physical activity required by some types of VR applications;

  • weight: as a result of wearing an HMD, VR users tend to modify their posture, thus stressing the musculoskeletal system (especially head and neck areas [133]); aside from the total weight of the HMD, another aspect that affects comfort is its distribution, as an imbalanced HMD design places greater torque around the neck of the user [31, 45];

  • pain and muscle fatigue: the aforementioned neck strain that comes as a result of HMD weight can become exacerbated by frequent head turns as the user looks around the virtual environment, while frequent mid-air arm movements required by certain application mechanics are likely to produce a feeling of heaviness and fatigue in the arms and shoulders, an effect referred to as gorilla arms [134]; when using VR applications that require users to squat or bend down, users may experience fatigue and/or pain in different areas of the back and legs; the issue of muscle strain is especially relevant for exergaming solutions, especially if they require the use of additional exercise equipment;

  • adjustability issues: commercial HMD designs incorporate wheels and straps for a more precise adjustment of headset fit and interpupillary distance; however, due to the headset weight and other design factors, VR users often struggle with finding the right balance between too tight and too loose, both of which can be uncomfortable for the skin or different regions of the face and head [46]. For example, based on our experience in conducting VR studies involving numerous users using various commercially available HMDs, in cases where the HMD is not properly fastened, users need to take frequent pauses for readjusting the setup, thus possibly breaking the immersion, while excessive tightness or friction between the HMD and the skin may leave the user with lingering facial redness and headset lines; the problems associated with headset adjustability are likely to be even more pronounced for users with refractive errors, especially in case they need to use their prescription eye wear underneath the headset [46]; HMD owners may find a solution in purchasing prescription lens inserts, but the issue remains for researchers conducting studies on multiple subjects using the same HMD.

3.7 Cognitive effects

In addition to potentially triggering different types of discomfort, the use of VR may affect certain cognitive processes. The impact of VR use on reaction time, mental rotational activity, perceptual speed and visual working memory was examined in a paper by Mittelstaedt et al. [135]. Immersion in VR did not result in declined performance for perceptual speed and visual working memory, and even produced an improvement in the processing speed for mental rotation. However, the authors report an increase in reaction time after VR exposure—an effect that has also been observed in [136, 137], although its etiology is not yet fully understood. While results presented in [136, 137] provide a link between the slowing down of reaction time and cybersickness, Mittelstaedt et al. [135] presented other possible explanations including visuomotor adaptation due to sensory mismatch and temporal adaptation to the slight delays introduced by the used devices. Szpak et al. [138] observed the impact of VR exergaming on decision time and movement speed during a multiple choice reaction time test. The authors did not find a significant effect for decision time, while motor movement time improved after 10 min of VR exposure, although the effect did not linger for long. This improvement in reaction time may be explained as a result of the physical activity required by the game.

While it is unclear whether the effect of VR on reaction time is significant enough to raise concern regarding the dangers of operating cars and heavy machinery immediately post-VR exposure, it is important to note that the average increase remains below 50 ms across different studies. Additionally, this effect may only be short-lived, and therefore easily mitigated by incorporating a short (e.g., 40 min [138]) wait period before attempting to perform any potentially hazardous activities. However, the value of measuring cognitive performance in relation to VR use extends beyond the implications regarding user safety—e.g., cognitive performance measures may provide a better understanding of cognitive processes required for adequate functioning in an artificial environment, or serve as a benchmark for assessing cognitive fatigue and quality/naturalness of different interaction mechanics and input devices.

4 The importance of pre-screening and participant choice

Gravetter and Forzano [139] state that external validity of a user study in the field of behavioral sciences refers to “the extent to which we can generalize the results of a research study to people, settings, times, measures and characteristics other than those used in that study”. Therefore, when performing QoE assessment, researchers should aim to formulate their research methodology in a way that would enable the study to be reproduced with similar results. Additionally, to achieve valid results, study conditions should be adapted to resemble a real-world usage scenario. An important component to achieving a high level of external validity is the extent to which we can generalize the results of a study from a sample population to the general public. Unfortunately, with regards to VR, this proves to be a significant challenge.

4.1 Experience and preconceptions

The so-called halo effect [140, 141] happens when judgements about unknown characteristics of an entity (e.g., person, object, system) are made based on its evident and/or previously known characteristics. At the time of writing, VR is still widely considered to be a niche type of technology, owned by a small number of early adopters (e.g., gaming enthusiasts). Due to the relative scarcity of VR systems, and their reputation for creating a more immersive experience in comparison to virtually any other platform, the possibility of participating in a VR research study tends to arouse interest in potential subjects, especially those of a younger demographic. This preconceived enthusiasm might influence study results by clouding the participants’ judgement of the system or the application.

On a related note, the novelty effect happens when test subjects’ perception and responses in a research study setting (which is considered to be a novel situation) deviate from their perception and responses in a real-world situation [139]. With VR, the users may also be influenced by the perceived novelty of the platform itself. While Fairchild et al. [142] noted that novice users may experience VR in a negative way, Hupont et al. [29] attribute positive affective states experienced by test subjects to the novelty of the VR platform. Whether positive or negative, the potential impact of platform novelty on user experience should be considered when choosing test subjects and/or interpreting study results.

Considering study results serve as input for creating QoE models and guidelines to be used by hardware manufacturers, content developers, and network service providers, we believe that researchers should try to avoid relying on conventional convenience sampling, which is likely to result in a very inexperienced sample of participants due to the aforementioned issue of relative scarcity of VR systems. Instead, assuming the necessity of nonprobability sampling approaches, researchers could lean towards quota sampling as a way to achieve a more balanced distribution of users with different levels of experience. If advanced users are not available for study participation, researchers could look into participant recruitment via crowdsourcing platforms [143]. Alternatively, some of the less experienced subjects may be given several training sessions prior to the actual test session as a way to mitigate the impact of aforementioned psychological effects (e.g., cybersickness adaptation training, as discussed in [144]).

4.2 Ethics, health and safety

Madary and Metzinger [145] highlight the importance of pre-screening as a way to remain in compliance with the principle of non-maleficenceFootnote 3, which instructs researchers to construct their experiments in a way that ensures no significant or long-term harm would come to subjects as a consequence of participating in the study. The authors especially warn about the well-being of participants with psychiatric disorders (whether diagnosed or undiagnosed). It is important to stress, though, that VR is often used in treatment of certain psychiatric disorders, in which case pre-screening should be employed with the purpose of finding participants with that particular disorder. However, even in those cases, researchers should remain mindful of the possible psychological impact, and exclude participants whose psychological vulnerabilities or other conditions put them at risk. Therefore, depending on the aim of the study, researchers should define appropriate exclusion criteria by using specialized questionnaires to assess whether the user has previously exhibited or currently exhibits signs and symptoms (e.g., dissociative experiences, psychotic episodes, suicidal ideation) of certain disorders that may get aggravated by the experience. Behr et al. [146] suggest screening participants for space-related phobias (e.g., claustrophobia, agoraphobia), as well as other phobias specifically related to the test material.

Lewis and Griffin [147] offer suggestions for screening participants prior to the clinical use of VR. They advise against including participants who are ill with diseases such as influenza, ear infections or ocular defects, suffering from balance disorders and/or taking medication that affects visual or vestibular fuction, currently under the influence of alcohol, or prone to motion sickness or cybersickness. These pre-screening guidelines may also be utilized for non-clinical studies involving VR exposure. In general, it is often advised that people who show high levels of sensitivity to cybersickness should not be exposed to VR [146] even in a research setting. However, from the perspective of product developers, including more vulnerable participants allows for a deeper level of insight, which can then be utilized to improve the application or system. Therefore, an alternative approach [148] is to use questionnaires to purposefully select subjects who have previously showed signs of cybersickness or motion sickness, as well as inexperienced and elderly users, who might be more sensitive, while also choosing to invite a larger number of participants in case sensitive users have to terminate the experiment due to the onset of symptoms. In terms of visual impairment and ocular symptoms, participants may be excluded based on their scores on visual acuity, color-blindness, or stereopsis tests.

Researchers exploring interactive VR may benefit from examples and guidelines regarding the inclusion of users with disabilities in gaming user research, presented in [148]. For example, study administrators should be on the lookout for situations that might make participants feel frustrated and vulnerable, such as not being able to successfully perform the required activity, being tested in a group setting or involved in a multi-player game. In order to adapt the process to the specific needs of each participant, researchers may need to consult with medical experts, therapists, and/or caretakers, as well as with individual participants if necessary. In general, it is very important to keep the whole process of testing as flexible and adaptable as possible.

4.3 Diversifying the study population

According to research, sex plays a significant role in evaluation of vitally important elements of the VR experience, i.e., the perception of presence [149] and succeptibility to cybersickness [117], with researchers suggesting that VR technology tends to be more adapted to male users. However, to fully understand these implications, further comparisons need to be drawn based on experimental data. To be able to compare sex differences, scientists should strive towards achieving a balanced sex distribution of study participants, while also considering gender differences in experience with VEs, such as games.

VR systems and applications are mostly geared towards a younger, tech-savvy, audience. On top of that, college students are over-represented in user studies regarding human psychology and behaviour [150]. While quantifying the percentage of college students participating in VR user studies is beyond the scope of this paper, we are under the impression that recruiting this demographic is a common practice among VR researchers. Unfortunately, this also means that their findings are not necessarily applicable to a wider range of VR users, as differences in VR user experience between users of various age/age groups have already been noted by researchers (e.g., [151, 152]). For example, the perceived ease of use with VR technology [26] may differ based on age/age group, and age may play a certain role in the susceptibility of cybersickness (as mentioned in Sect. 3), illusion of body ownership [153], as well as immersion and presence [154]. The oversampling of the young adult demographic can be prevented/counteracted by incorporating participants of different ages into the study population, or designing studies specifically for underrepresented age groups, such as children, or the elderly. However, incorporating these age groups may require special consideration.

Researchers (as well as VR system manufacturers and VR application developers) warn about the unknown impact of VR use on children and young adults with regards to their psychological and neurophysiological development [145]. However, Tychsen and Foeller [155] conducted a user study on 50 children (aged between 3 and 10 years old) and reported that 94% of participants experienced no significant differences in postural stability, as well as no observable symptoms of dizziness and cybersickness, while playing a flying VR game. Additionally, measuring horizontal VOR in a small subset of participants (5 children)—before and after VR use—yielded no evidence of vestibulo-ocular maladaptation. However, subjective scores for cybersickness, dizziness, eye strain, and head/neck discomfort post-VR were higher compared to baseline, although the authors note that the observed difference is only statistically, but not biologically, significant. While there is a need for further research (especially longitudinal studies) in this area, according to this study, limited VR exposure (such as participating in a short study session) is not likely to cause long-term psychological or physiological damage to young participants if appropriate precautions are taken (e.g., limiting the session duration, repeatedly reminding participants to terminate the experiment if they experience fear or discomfort, exposing them to age-appropriate material only).

In terms of elderly users, VR may be used for entertainment, to diagnose and treat conditions such as Alzheimer’s disease, or as an aid in physical therapy, helping users to improve their balance or motor skills. However, when designing studies for this demographic, it is important to adjust the content and/or study methodology to their specific limitations regarding mobility, cognitive abilities, and computer literacy. As with other participants with disabilities, elderly users may not be able to successfully participate in studies that are not designed for them specifically, as they may find it difficult to navigate the application or perform certain physical movements (e.g., turning [156]), which can lead to frustration and lowered confidence [148]. Elderly users may experience difficulty with vision and hearing [156], which can be counteracted by adjusting the volume of the test material, presenting the instructions in a clear, comprehensive way, and repeating them as needed. Furthermore, it is important to keep in mind that older individuals may be at a higher risk of falling during VR experiences [157].

5 Guidelines for preparation of appropriate test material

The test material used for conducting VR user studies depends on the aim of the study, and can range from applications with a practical purpose, such as those intended for therapeutic use (e.g., physical therapy, cognitive therapy, phobia treatment), educational applications or scientific visualisations (e.g., medical applications, military training), to applications intended for entertainment purposes (e.g., games, drawing in VR). Test material can be developed specifically for the study, or it may be a short sample of an existing application. The latter option is especially appropriate for VR gaming studies. As suggested by the ITU-T Recomm. P.809 [70], researchers should carefully select a sample that displays a mechanic that is typical for the game (or another type of application). If using a fixed level of difficulty, researchers should aim to select a sample that is appropriate for participants with various levels of experience. Otherwise, they may choose to keep it adjustable, so that it can be adapted to fit the skill level of each participant. Prior to conducting the actual study, test material should be thoroughly examined to ensure that the application runs smoothly. As discussed in [148], the frustration caused by encountering bugs and crashes during a test session is likely to degrade reported user experience.

Schatz et al.[158] highlighted the deficit of standardized VR content as well as a lack of standardized test tasks that would enable the reproduction of user studies across different laboratories and research groups. An example of such a test task can be found in [33], where the authors use a simple pick-and-place task to compare the performance of different VR systems. While design, development, and distribution of standardized test content remains an open challenge, researchers can facilitate comparison between studies by describing the used application, as well as chosen methods of interaction and locomotion.

5.1 Ethics, health and safety related to choice of test material

As previously discussed, researchers are expected to follow the principle of non-maleficence. Thus, the material chosen or created for the study should not inflict significant or long-term psychological or physical harm.

5.1.1 Avoiding psychological harm

Virtual environments (VEs) differ from other types of media based on two main characteristics [159]: saliency (i.e., VEs provide a more salient/vivid experience by combining multiple sensory stimuli) and agency (i.e., VEs allow the user to interact with their surroundings). The information overload during VR use [146] is a result of high levels of saliency, enabled by the inherent multimodality of VR systems which expose the user to various sensory stimuli (predominantly audio-visual, often haptic) at the same time, combined with the system’s intrusiveness. Unlike hand-held or desktop displays, VR HMDs are strapped onto the user’s head and often equipped with integrated headphones or used with external earphones. This setup is purposefully designed to “override” any audio-visual input from the real world, leading to greater immersion, but making it difficult for users to avoid or escape [146] the artificial sensations they find uncomfortable or overwhelming.

The small lens-to-eye distance in HMD-based VR systems may cause the user to experience the virtual world more concretely compared to other platforms, yet even in CAVE-based VR studies, participants have been shown to respond to a stressful situation with subjective, behavioral and physiological reactions, despite being aware of the artificiality of the presented stimuli [160]. Segovia et al. [161] use HMD-based VR to demonstrate the impact of situations experienced in immersive VR on the moral identity of the user. This connects back to agency as a defining aspect of virtual environments. Agent regret refers to the phenomenon of a person experiencing more guilt after performing an innocent action that led to a negative outcome of a certain situation, than they would have if they merely witnessed the negative outcome without having performed any action at all [159, 162]. A VR application which contains disturbing material may therefore interfere with the user in a more significant way compared to e.g., watching a video based around a similar theme.

Spending a longer period of time in VR may cause issues with discerning between the virtual world and the physical reality, as seen in [163]. While short-term effects, such as experiencing so-called Game Transfer Phenomena (GTP) [164] shortly after exposure to a non-stressful VR application, may not pose a significant threat to psychological and emotional well-being of the participant, the impact of immersion may be increased or prolonged in case of exposure to stressful, scary or otherwise disturbing content. Despite obviously not being real, disturbing media content (e.g., a horror movie) can leave a long-lasting, even lifelong, negative impact on the consumer, resulting in media-induced trauma [165]. However, a study done by Lin [166] showed that lingering effects of a horror game in VR may not be as common or as intense as one might expect, considering only a small number of participants reported experiencing them the day after the study. Despite these findings, it is advisable to avoid exposing users to uncomfortable content unless it is highly relevant for the specific study. In case the test application involves potentially unnerving material, participants should be warned in advance, as well as encouraged to pause or terminate the experiment by taking off the HMD. The test application should include a virtual safe space [167] which allows participants to immediately (i.e., with a button press) escape the anxiety-provoking stimulus without physically removing the VR equipment, as it might be difficult to loosen the straps and take off the headset quickly whilst holding the controllers. Aside from the ethical issue of being exposed to potentially traumatising content, witnessing disturbing events in VR is likely to produce physiological reactions which, if registered by devices such as an EEG or a heart-rate monitor, may complicate subsequent analysis by increasing the ambiguity of the results.

5.1.2 Avoiding discomfort and cybersickness

Table 1 VR application design guidelines and findings for mitigating cybersickness, discomfort, and other health risks

Considering that, even after decades of ongoing research and development, cybersickness in VR still remains a pressing issue for VR scientists, developers, and users alike, there is certainly a need for further research in this area. However, in order to avoid inflicting physical harm while researching the condition or conducting VR user studies in general, researchers should choose or develop test material based on the state-of-the-art knowledge of design factors that might impact the occurence of cybersickness and other types of discomfort. A compilation of guidelines and useful findings is presented in Table 1.

6 QoE assessment study methodology

In accordance with the principle of respect for persons [187], the autonomy of each study participant has to be respected, which means that researchers have the responsibility to provide relevant information about the study and ask for consent prior to actual data collection. After the consent form is signed, a pre-test questionnaire is given as a way to collect personal information about the participant. Similarly to questionnaires used in gaming research [70], pre-test questionnaires used in VR studies usually encompass questions about the basic demographic data (age, sex/gender, profession, ethnicity), as well as inquiries about the skill level and prior experience. Participants may be inquired about their history of illness, or asked to fill out specialized questionnaires as a way to assess their personality traits, or psychological and/or physiological sensitivities. Along with questionnaires, researchers may choose to include standardized vision acuity tests in their pre-testing process. Acquired information aids in later analysis and interpretation of study results, but it can also serve as exclusion criteria. Participants should be made aware that they are allowed to pause or terminate the experiment at any time. Instructions regarding equipment, material, and assessment methods should be carefully worded, easy to understand, and presented to each participant in an identical way, which helps mitigate instruction bias [148]. Following the instruction phase, participants are equipped with VR and measurement devices, the positioning of which may require some assistance from the administrator. It is highly advisable to sanitize the equipment (headset, handheld controllers, and any other devices that come into contact with the participant) before each session, which is especially relevant in light of the recent COVID-19 pandemic. If possible, study administrators should provide each user with a disposable mask that provides a barrier between their skin and the headset. It is advisable to warn participants against operating a vehicle following the exposure to VR content. Although there are no official guidelines at the time of this writing (to the best of our knowledge), the duration of the recommended waiting period will likely depend on the intensity of the application and the duration of the VR exposure [95]: several minutes of exposure to a commercial VR application may require only a short 30–45 min waiting period, while a longer exposure to a flying simulator may require a waiting period of 12 to 24 h. The last step before the actual testing phase is a short tutorial session which facilitates adaptation to the application and the technology. Details regarding temporal or environmental aspects of study design are discussed in Sects. 7 and 8, while the remaining part of this section provides an overview of commonly used assessment methods in VR user research.

6.1 QoE assessment methods

At the time of this writing, there is no standardised methodology for assessing the QoE of VR applications (although efforts are underway in the scope of ITU-T Study Group 12 [188]). However, there are a number of instruments that have been used accross various studies addressing the assessment of QoE-related features, such as immersion and presence, as well as side-effects such as cybersickness.

6.1.1 Subjective methods

The use of questionnaires is the most common subjective method used in QoE studies, although it may be supplemented with other methods, such as interviews and diary entries. In most cases, individuals are asked to fill out questionnaires directly related to tested scenarios either during or immediately after testing. Most commonly, users are required to mark their answer on a rating scale. Users may be asked to provide their rating of the overall QoE or its individual dimensions. Instead of using individual questions, researchers often choose to use more established multi-item questionnaires designed to measure a certain aspect (or multiple aspects) of quality. For example, usability can be evaluated using the System Usability Questionnaire (SUS) [189], while the Self-Assessment Manikin (SAM) [190] may be used to assess the user’s affective response.

Certain questionnaires used in QoE research cover a diverse range of features and are intended to be used as a single tool for the evaluation of the overall quality, such as the Game Experience Questionnaire (GEQ) [191, 192] or the Player Experience Inventory (PXI) [193, 194] which are designed for the gaming use-case. Unfortunately, due to the specific characteristics of interactive VR, questionnaires that were initially developed with non-immersive platforms in mind can not be used on their own (i.e., they need to be combined with other measures, which can sometimes be fatiguing for participants and complicates subsequent analysis of results) as they do not include certain aspects that are especially relevant to the VR platform, such as discomfort and cybersickness. This highlights the importance of developing questionnaires that can be used for the evaluation of QoE/UX based on specific features that are relevant for interactive VR. An example of a VR questionnaire that evaluates multiple different features (i.e., general user experience, game mechanics, in-game assistance, symptoms and effects induced by VR) is the Virtual Reality Neuroscience Questionnaire (VRNQ) [195], but its use is limited to VR gaming, rather than VR in general. Tcha-Tokey et al. [196] developed a more general-use VR UX questionnaire comprised of nine subscales: presence, engagement, immersion, flow, emotion, skill, judgement, experience consequence (which measures symptoms of fatigue and cybersickness), and technology adoption.

The problem with subjective measures is that they are self-reported and therefore cognitively mediated, which leads to distortions and undermines their validity. E.g., participants often tend to avoid either extreme of the scale (central tendency bias), or respond in an excessively positive/agreeable manner (acquiescence bias), while further issues stem from the improper or unclear wording of questions themselves. Lastly, since participants’ view of the real world is obscured by the VR HMD, their answers are often noted by an administrator, which may influence the participant [197]. Therefore, if possible, subjective assessment questionnaires should be integrated into VEs used for testing [198].

6.1.2 Objective methods

In addition to subjective methods, objective methods (physiological, behavioral, and task performance measures) are often used to assess user experience in a less biased way. Physiological methods are based on measuring different physiological signals such as electrocardiography (ECG), electroencephalography (EEG) and galvanic skin response (GSR). Due to their design, certain medical instruments used to collect this data may hinder user experience and degrade QoE scores, so less intrusive devices, such as fitness bands and smart watches, can also be used for collecting physiological signals [199]. As discussed in Skorin-Kapov et al. [200], the use of psychophysiological measurements in assessing user experience improves existing QoE models, especially in terms of user-related factors, and mitigates issues stemming from the use of self-reported assessments [201, 202]. However, it should be noted that it can be challenging to adequately recognize the affective state of the user based on physiological measures only, as different states may be indicated by very similar physiological symptoms [203, 204]—for example, both excitement and stress tend to increase the heart rate of the user. Furthermore, certain methods for measuring physiological signals appear to be sensitive to noise introduced by head movement (e.g., EEG [205]), while others, such as functional magnetic resonance imaging (fMRI), require complete stillness. Therefore, the results of such methods may not be accurate unless the study happens to be consciously designed in a way that aims to keep the user as stationary as possible. Since head movement in VR is not only extremely common, but also highly encouraged through VR application design, the degree to which the results acquired in stationary conditions can be considered representative of realistic VR use is yet to be determined.

Behavioral methods refer to methods that are based on observing and tracking user behaviors, such as physical movement (e.g., “ducking” to dodge an approaching virtual object [206]) and social interaction (e.g., moving away from an avatar or an embodied agent [207]). To assess user preferences or adaptation mechanisms in the context of VR application use, researchers may decide to track and categorize different actions that the user chooses to perform inside of the interactive VE. In addition to larger bodily movements and conscious actions, researchers may choose to observe more subtle behaviours by incorporating methods such as gaze tracking and emotion recognition, made possible by the growing inclusion of eye tracking and facial recognition technology in more recent headsets.

In general, user performance in multimodal interactive systems, such as VR, encompasses three components [208]: perceptual effort, cognitive workload, and physical response effort. Task performance measures (e.g., time to complete task, measures pertaining to spatial and temporal accuracy) aid in quantifying the effort produced to accomplish a task, and may serve as an objective indicator of the impact of different factors on the users ability to interact with the service in a successful and efficient way, thus providing an objective measure for the evaluation of QoE features such as ease of use and interaction quality. However, to increase the chances of obtaining conclusive and valid results, it is important to choose tasks and measures that are relevant to the observed system/environment.

6.1.3 Measuring presence and immersion

Table 2 Overview of presence questionnaires (adapted from [75, 84])

Subjective measures: Subjective ratings are commonly collected using questionnaires, with a concise list of questionnaires addressing presence and immersion presented in Table 2. For a more comprehensive analysis of presence questionnaires, the reader may refer to [75, 84], while an overview of studies related to presence and immersion, including information regarding presence questionnaires, is presented in [81]. While the majority of immersion/presence questionnaires are constructed from multiple items, Jennett et al. [219] report that a single-item questionnaire (i.e., “rate how immersed you felt from 1 to 10”) appears to be a reliable measure of immersion as well, which has recently been supported by findings presented in [223].

However, the practice of using questionnaires as a primary method of measuring presence/immersion has been heavily criticised for various reasons. For example, such abstract constructs tend to be loosely defined and therefore open to interpretation, as discussed in [224]. Furthermore, asking users to report their sense of presence/immersion in the middle of the experience that is being evaluated will likely lead to its disruption (as discussed in [225]), and reporting the sense of presence/immersion after the experience has ended relies on potentially inaccurate recollection [226]. Keeping within the context of subjective, self-reported measures, instead of presenting questionnaires during or after a session of VR use, Slater and Steed [225] suggest tracking breaks in presence (BIPs) during exposure to a virtual environment. This method requires users to report transitions from the state of absorption in the virtual environment to the state of being “back to reality”.

Objective measures: Notable behavioural measures for assessing presence are reactions to conflicts between virtual cues and real cues [227, 228] and actions such as reflex responses to virtual events [206]. On a related note, Lepecq et al. [229] propose afforded actions as a way to evaluate presence in VR environments, as users tend to perform behavioral transitions (body rotation) to adapt to the characteristics of the presented virtual environment (narrow virtual aperture) in relation to their own body characteristics (width of shoulders). This is similar to the approach taken by Usoh et al. [216], who observed the path taken by users immersed in a virtual environment to see whether they would choose to step over an unsettling virtual pit or try to walk along its edge as they traverse across the room. Moreover, the aforementioned paper by Usoh et al. describes a combination of different measures (behavioral as well as subjective) that are used to construct the measure of behavioral presence—the degree in which “actual behaviors or internal states and perceptions” suggest a sense of being in the virtual environment instead of the real, physical one.

Given that VR experiences may be able to elicit reactions and emotions that are comparable to those that arise in real-world situations [75], immersive technology has a wide field of application in social science research. Scientists often use methods that rely on observing and tracking human interaction in multi-user environments, and are therefore used for the exploration of constructs such as social presence and co-presence. For example, as presented in [230], task performance metrics can be used as a way to measure social inhibition and facilitation (e.g., [231, 232]) when faced with real or virtual humans, measuring interpersonal distance and personal space (e.g., [233, 234]) is used in the context of proxemics research, and tracking eye gaze and facial expressions can provide information regarding the affective state of the user in a social situation (e.g., [235, 236]).

Presence can also be assessed by examining physiological measures (e.g., GSR, EEG, heart rate, body temperature). Meehan et al. [226] listed subject bias and inaccurate recollection as disadvantages of subjective measures, while also taking note of experimenter bias that may occur with the use of behavioral measures. Thus, the authors reported looking for a measure of presence that meets several criteria: validity in terms of correlation with broadly accepted subjective measures of presence, objectivity, sensitivity to different levels of presence, and reliability/repeatability. Comparing different physiological measures (heart rate, GSR, body temperature), heart rate was found to be the best in meeting the abovementioned criteria, followed by GSR [226, 237]. As opposed to heart rate, GSR did not show promise as a between-user measure.

Bouchard et al. [204] criticise the use of physiological measures of presence, as changes in heart rate and GSR are well-established measures of anxiety, and therefore likely indicate anxiety—rather than presence—in stressful virtual environments. Thus, the authors describe these measures as “at best, proxy measures of presence in anxiety-related contexts” (an example of this can be seen in [61]). Bouchard et al. also argue that exposing the user to both real and virtual situations to see whether they produce similar physiological responses is a much better way of measuring presence than the common approach of measuring changes in physiological signals during various virtual scenarios. An example of an approach based on comparing physiological signals in a virtual situation and a real situation can be seen in [238], as authors use EEG data to confirm previous findings [239, 240] regarding the activity in the parietal lobe and how it relates to the experience of presence.

6.1.4 Measuring cybersickness and VR-related discomfort

Subjective measures: Kellogg et al. [241] developed the Pensacola Motion Sickness Questionnaire (MSQ). Kennedy et al. [93] later developed a condensed version of the MSQ entitled Simulator Sickness Questionnaire (SSQ), which is the most commonly used questionnaire for evaluating cybersickness. However, as VR technology slowly begins to enter the mainstream, researchers are growing more aware of the need to differentiate between motion sickness, simulator sickness, and cybersickness, as discussed in Sect. 3. Although a popular choice among VR researchers, the SSQ may not be an ideal choice for assessing VR-induced discomfort [127, 242]. Therefore, several similar questionnaires have emerged, developed specifically for the VR platform. Ames et al. [101] concluded that the SSQ did not include enough ocular symptoms to be fully appropriate for evaluating immersive environments, and developed a new questionnaire called Virtual Reality Sickness Questionnaire (VRSQ), specifically intended for use with head mounted displays. Another VR-specific SSQ-based questionnaire of the same name and abbreviation was developed thirteen years later, by Kim et al. [243]. In terms of size and questionnaire items, the VRSQ (2018) questionnaire is similar to the CyberSickness Questionnaire (CSQ) [242], also a modification of the SSQ developed for VR. A comparison of symptoms tracked by MSQ, SSQ, CSQ and both VRSQ questionnaires is presented in Table 3.

Questionnaires such as the SSQ and its variations are given after (and sometimes before) a specific VR experience. However, researchers may choose to examine the users’ overall susceptibility to cybersickness, which is usually investigated prior to VR exposure, often using the revised versions [244, 245] of the Motion Sickness Susceptibility Questionnaire (MSSQ) [96, 246]. While the MSSQ (revised) questionnaire (long and short) investigates users’ previous experiences with sickness during exposure to different types of motion, its final version does not include items pertaining to experiences that are primarily associated with visually induced sickness (e.g., virtual reality). However, considering the similarities between motion sickness and cybersickness [92], it is commonly used in VR research. Moreover, it shows a positive correlation with post-VR SSQ scores [247].

Table 3 Comparison of symptoms assessed by the MSQ [241], SSQ [93], VRSQ [101], CSQ [242], and VRSQ [243] questionnaires

The main disadvantage of longer questionnaires is the long time it takes to complete them, especially if the study requires completing them multiple times in a session. Because of this, the time needed to fill out a multi-item questionnaire might lead to a decrease in cybersickness symptoms [248] which can influence the results. Therefore, in addition to multiple-item questionnaires, single-item questions are also commonly used across different studies, although they tend to be more study-specific and less extensive.

In terms of measuring pain, discomfort, physical exertion and fatigue, researchers often use scales developed by Borg [249]. Due to its specific scaling, which ties verbal anchors to values between 6 and 20, the Borg Rating of Perceived Exertion (RPE) scale is able to provide an estimate of the users heart rate. The Borg CR10 scale provides a simpler scaling system, with verbal anchors corresponding to values between 0 and 10. A methodology that examines multiple symptom groups within the same study, combining the SSQ with subjective measures of ergonomic symptoms (including the use of the modified Borg CR10 scale) and digital eye strain, is presented in [127]. To gain a deeper insight regarding the effort necessary to interact with the virtual environment, and better understand which dimensions contribute towards greater frustration and fatigue, researchers may employ subjective measures of workload, such as the NASA Task Load Index (NASA-TLX) [250] or the novel Simulation Task Load Index (SIM-TLX) [251], developed with the VR platform in mind.

Objective measures: Kim et al. [252] and Dennison et al. [253] tracked various physiological signals using different modules of a Biopac polygraph: ECG, electrooculogram (EOG), electrogastogram (EGG), GSR etc. Kim et al. [252] found that gastric tachyarrhytmia, blinking, breathing and heart rate significantly correlated to the cybersickness score. Dennison et al. [253] conducted a study on twenty individuals, examining the impact of virtual reality use on cybersickness. Their findings show that changes in breathing, blinking and stomach activity may serve as indicators of cybersickness. Results by Wu et al. [254] show that impaired response inhibition can indicate cybersickness, which can be assessed by measuring inhition-related components of event-related potentials (ERPs).

For pre-screening participants and evaluating vision-related symtoms (e.g., [138]), researchers may choose to use eye charts and tools for assessing color perception (e.g., Ishihara test [255]), distance vision (e.g., Snellen chart [256]), near vision (e.g., Fonda-Anderson chart [257]), and stereo vision acuity (e.g., Butterfly Stereo Acuity test), as well as vergence and accomodation (e.g., Royal Air Force near point rule [258]). Iskander et al. [129] mention the potential of VR HMDs equipped with eye-tracking technology, as they highlight the deficit of datasets containing captures of coordinated eye and body movement during immersive VR, which would aid in assessing visual fatigue. Eye-tracking technology enables researchers to collect different types of ocular measures, such as gaze direction, fixation duration, blink duration and frequency, and pupil dilatation. In addition to their use in assessing fatigue [36], eye movements may also be used as a measure when exploring the effect of VR on cognitive processing, e.g., by tracking saccadic eye movements using VR-specific tools such as [259]. Moving away from ocular measures, an example of a methodology incorporating various cognitive performance (reaction time, mental rotation, visual search, visual working memory) measures is presented in [135]. Considering reaction time tests as a popular choice among tools for assessing cognitive performance, researchers may choose to use tools measuring both simple and choice reaction time, such as the Deary Liewald reaction time task [260], or a tool such as the CANTAB 5-choice reaction time task [261], which provides results for decision and motor movement speed.

7 Temporal aspects of QoE assessment

Fig. 4
figure 4

Time spans of user experience (adapted from [262, 263])

Figure 4 depicts time spans of user experience, based on models presented in [262] and [263]. Before the user even starts interacting with the system, they form a set of expectations about the experience (an internal reference [264]). E.g., these expectations may form as a consequence of the user’s previous experience with a similar system, or they may be a result of the halo effect. As the user begins to interact with the system, they perform a series of momentary evaluations of the experience (comparing actual experience to their internal reference), based on which their is able to form a reflective evaluation of an episode of use. Repeated use of the system allows the user to make judgements over the span of multiple episodes, and impacts their summative evaluation of the system as a whole. An in-depth analysis of temporal development of QoE is given in [264].

7.1 When (and how) to measure momentary and reflective QoE

Subjective ratings of reflective QoE are usually collected post-episode via single- or multiple-item questionnaires. During use, the perceived QoE is continuously changing based on the current (momentary) level of quality, and may even increase or decrease drastically in case of sudden changes. However, as explained in [264], when an episode of use ends, and the user is inquired about their experience, they are more likely to report a level of quality that correlates to their initial/first (i.e., the primacy effect) or their more recent (i.e., the recency effect) momentary judgements, which suggests that measuring reflective QoE does not provide an accurate evaluation of momentary experience. Furthermore, if, after encountering an impairment during use, users experience a certain period in which their experience is not impaired, they may be more likely to disregard the impairment as they reflect on the experience, which is known as the forgiveness effect [265]. With this in mind, depending on the IFs examined in the user study, researchers should decide whether it is more appropriate to measure momentary or reflective QoE (or both), and choose suitable measures and methods based on this decision. Researchers may ask the participants to evaluate subjective momentary QoE by assessing the quality of a series of very short (i.e., several seconds) samples which comprise a longer test stimulus, or by continuously reporting the quality of a longer stimulus using a slider or some other type of mechanism that allows for continuous collection of momentary ratings [264]. However, attempting to evaluate QoE in this way means that the user’s attention is continuously being divided between the material they are trying to evaluate and the evaluation task itself [9]. In the context of user experience with a medium such as VR, which strongly relies on the sense of “being” in the virtual world, divided attention and/or constant interruptions are likely to diminish the level of presence/immersion experienced by the user [225] and thus significantly affect the overall VR QoE. A less obtrusive approach relies on the use of physiological measures with a high sampling rate [9], such as EEG, GSR, and heart rate.

7.2 Considerations regarding the duration of a single test scenario/study session

An important issue with measuring reflective QoE is determining the optimal duration of a test scenario. ITU-T Recomm. P.809 [70], which focuses on subjective evaluation of gaming QoE, describes two testing approaches depending on the aim of the user study. A short interactive test, lasting between 90 and 120 seconds, should be adequate for assessing more straightforward QoE features (i.e., quality of interaction). Long interactive tests, usually lasting between 10 and 15 mins, are more suitable for measuring affective states and evaluating complex features such as immersion, presence, or flow. However, while aforementioned recommendations should be taken into consideration as interactive VR applications (especially VR games) share many similarities with other VEs, such as games played on less immersive platforms, researchers should also consider VR-specific issues and health risks when determining the duration of VR exposure for user studies.

Murata and Miyoshi [266] used a posturography technique based on a force platform to measure body sway during VR use. Results obtained under the control condition (i.e., while not using a computer/VR system) showed that postural instability and cybersickness tend to remain stable over the course of three hours. On the contrary, during the three-hour experiment, postural stability of participants immersed in a VR environment gradually decreased, while symptoms of cybersickness increased compared to the pre-immersion condition.

Wang and Suh [125] present a time-varying cybersickness model with trigger factors and adaptation factors, depicted in Fig. 5, and based on [267]. As users begin to experience cybersickness, their body starts to adjust (see Sect. 3.5.3), leading to a decrease in cybersickness. Even though users continue to adapt to cybersickness triggers, the sensation of cybersickness has a tendency to accumulate with prolonged exposure which can eventually lead to an unenjoyable experience, although this process is slowed down when adaptation mechanisms (e.g., adjusting their movements, taking a break) are employed.

Fig. 5
figure 5

Time-varying cybersickness model with trigger factors and adaptation factors (taken from [125]; based on [267])

Stanney et al. [15] conducted a cybersickness study (\(\hbox {n} = 1102\)) in which participants were exposed to a virtual environment for an assigned duration (15, 30, 45 and 60 mins). The authors reported a cybersickness rate above 80% and an increase in symptom severity with longer exposure. However, around the 45 min mark nausea- and disorientation-related symptoms stopped increasing, while oculomotor symptoms continued to worsen. Longer exposures produced a greater dropout rate. During the first hour after exposure, total severity of symptoms decreased by 30.7%, but even 2 to 4 h after exposure 73% of participants were still experiencing symptoms, while 35% continued experiencing them more than 4 h after exposure. 18% of participants reported cybersickness symptoms the following morning (approx. 24 h after exposure). While reported symptoms included nausea and oculomotor symptoms, the main type of symptom that remained after a longer duration was disorientation.

In case of multiple test scenarios during a single user session (especially if using long interactive tests), the total duration of VR exposure may significantly exceed the duration of 15 min recommended by y Stanney et al. [15], or even 30 min recommended by Drachen et al. [148]. During this time, symptoms of cybersickness may accumulate. This is an example of the effect known as multiple treatment interference, which happens when test subjects are asked to participate in a series of treatment conditions. In such circumstances, an effect caused by a previously experienced condition (e.g., tiredness, expertise) may carry over to the subsequent treatments, potentially influencing the results of the study [139]. In the context of VR QoE studies, multiple treatment interference may happen with factors/features such as physical symptoms (e.g., eye-strain, nausea), ease of use, affective states, as well as task performance measures. To a degree, randomizing the order of test scenarios may mitigate the issue of invalid QoE scores, while using test tasks that are designed with user comfort in mind (if appropriate for the study) may prevent or reduce physical symptoms. Readers interested in temporal factors involved in the experience of cybersickness may refer to [144] for a detailed overview of the topic.

7.3 Measuring repetitive and retrospective QoE

Karapanos et al. [268] discuss different approaches to collecting samples of user data in the context of repeated use. The pre-post approach refers to collecting and comparing participant data twice (i.e., at a point in time which is close to the beginning of the study, and again after a certain time period). The longitudinal approach is based on collecting a greater number of measurements. Wilson and McGill [269] highlight the deficit of longitudinal user studies evaluating the use of VR and its consequences. Considering that commercial VR is still in its early stages, there is a lack of knowledge regarding long-term usage and the way it reflects on one’s psychological and physiological health. Aside from health related issues, examining VR use over a longer period of time is vitally important for gaining a deeper level of insight about user behaviour and preferences, and the way they change over time.

Previous research has shown that the perceived importance of different characteristics of a product (e.g., perceived stimulation [270]) tends to change over time as the novelty wears off. Additionally, Fenko et al. [271] examined the shift in sensory dominance that happens as users spend more time with the product—while vision tends to be the dominant modality in the beginning, the perceived importance of other modalities, such as touch and audition, increases with further usage. Valuable information regarding this issue in the context of VR has been provided by Bailenson and Yee [272], who conducted fifteen sessions over the course of ten weeks, observing task performance, presence, cybersickness, user behaviour, and entiativity in a collaborative virtual environment. As the study progressed, participants spent less time looking at each other, suggesting an increase in reliance on audio communication. This gradual change in behaviour, coupled with the fact that participants experienced a reduction in cybersickness with repeated use, confirms that results acquired in a single VR session are not necessarily reflective of multi-episodic VR use.

Instead of the longitudinal approach, which usually includes collecting data on multiple occasions, during or shortly after every VR use, researchers may ask participants to recall their previous VR experiences and provide their overall assessment of the system as a summative evaluation. Retrospective recall of a single experience, or a collection of experiences, is memory-based, and therefore may diverge from any impressions formed during or immediately after usage. However, while introducing bias, retrospective recall should not be dismissed in the context of evaluating user experience/acceptance, since memories condition future behavior (forming an internal reference) and, if communicated, may influence other users [268].

8 Physical environment in VR research

Even though the goal of every VR experience is to immerse the user into the virtual world (producing what we call the place illusion), the physical environment of the study remains a relevant aspect of study design. In most interactive VR applications, the user’s physical movement translates to movement in a virtual environment (i.e., by moving within the tracked space, the user controls the movement of their avatar). This proves to be a safety issue as VR headsets obscure the user’s view of the real world, which can potentially lead to injury and material damage. While the process of path integration (i.e. using proprioceptive cues to monitor spatial positioning) enables spatial updating in absence of visual cues [273], the focus on traversing through the virtual environment, which usually involves some degree of physical motion (turning the head and/or body, walking etc.), tends to cause disorientation with respect to the user’s position in the real world. In addition to the issue of disorientation, being immersed in a virtual environment can interfere with the perception of egocentric distance [274], leading to mistargeted movement that may result in dangerous collisions. Fortunately, the issue of incorrect egocentric distance perception has been greatly reduced in newer VR systems, such as HTC Vive [275] and Oculus Rift [276]. Nevertheless, in order to counteract these threats to participant safety, participants should be supervised at all times, and studies should be conducted in a spacious, uncluttered environment. Certain environmental conditions, such as hot temperatures or high humidity, may increase the likelihood of cybersickness [277]. Thus, it is advisable to keep the space well-ventilated, provide water and snacks [116], and a comfortable place for participants to sit or lie down in case they experience the onset of cybersickness symptoms.

Stepping away from the issue of participant comfort and safety, the location and the overall context of the experiment pose a significant influence on decisions regarding methodology, as well as on the overall outcome of the study and its internal/external validity. Due to the inherent characteristics of the environment, and accompanying contextual variables, the study performed in a laboratory (i.e., an environment that is specifically intended for scientific research) may greatly differ from a field study (defined as “research conducted in a place that the participant or subject perceives as a natural environment” [139]).

8.1 In a laboratory setting

Conducting a user study in a laboratory is a very common practice in VR research, which is no surprise, given its number of benefits. Designated laboratories adapted for VR testing are usually spacious, and supplied with advanced VR equipment, which can often be problematic in terms of transportation and setup, especially if it includes large, complex devices such as a VR treadmill or exercise equipment. Conducting the study in a specialized enclosed space gives researchers more control over factors such as temperature, humidity, and the allowed number of people, which creates a higher level of comfort (both physical and psychological) compared to a public setting, while the presence of an administrator serves as an additional safety measure compared to non-supervised studies, such as those conducted in participants’ living spaces. The most obvious benefit of a laboratory environment, however, is the increased internal validity of the study, which is a result of controlled environmental variables. However, this characteristic of the laboratory setting has a downside—evaluating the application in such a sterile, artificial environment negatively impacts the external validity of the study, as acquired results may not be representative of real-world usage [139].

8.2 In “the wild”

Choosing to conduct the study outside of a laboratory requires changes in methodology and duration. These changes can go both ways - compared to laboratory studies, methodology may be more limited in case of public walk-in studies, or more extensive in case subjects are able to participate from the comfort of their homes. Likewise, study duration of field studies varies greatly—for example, a study conducted at a public place/event may have to be shortened to only a few minutes (e.g., [278]), while moving the study to a home setting may even allow for longitudinal research (e.g., [279]).

8.2.1 At a public place/event

Conducting a VR user study at a public place (e.g., amusement park, shopping mall) and/or during a public event (e.g., exhibition, convention) is a convenient way to assess short-term QoE/UX for a large number of participants. Careful selection of the venue/event can be used to facilitate access to the target demographic (e.g., researchers may choose to conduct a gaming QoE/UX study at a gaming convention visited by a large number of avid gamers) without a tedious pre-screening process.

However, this type of setting has its fair share of obstacles in terms of methodology design. Firstly, in such cases, the duration of the study process is generally kept at a bare minimum (e.g., 2–5 min [278]). Due to such brief exposure to VR, participants are not likely to experience more complex aspects of VR experience (e.g., immersion [70], cybersickness) to their fullest degree. Thus, when conducting a study in such a public scenario, researchers should be aware of the limitations imposed on the choice of observed factors/features, and their implications on the validity of subsequent results. Wearable devices, whether used for position tracking or collecting physiological signals, are generally too cumbersome for the fast subject turnover of a public walk-in study. For example, there may not be enough time to calibrate devices for individual use, or acquire baseline measurements of physiological signals—e.g., each analysis interval during which continuous EEG data is being collected should be around 5–10 min long, following a 2–5 min period for the collection of baseline measurements [280]. Therefore, researchers may choose to use questionnaires [281], or rely only on behavioral methods [278]. Moreover, using a VR application in a public setting could trigger certain users to feel uncomfortable, exposed, or pressured, as certain people consider the public use of VR to be embarrassing [152, 282, 283], which may influence their subjective assessment, or cause them to interact with the application differently then they would if they were using it in a more private situation. The influence of being watched whilst immersed in VR is analyzed in detail in a paper by Mai et al. [283]. The authors also elaborate on other possible issues with the public use of VR, such as unwanted touches and the increased likelihood of injury in case of collision with a bystander. Based on these observations, the authors present valuable findings and suggestions on the use of spatial, visual and auditory separation between the person immersed in a VR experience and other people, the inclusion of a supervising person to watch over the user and help them feel more comfortable, and scenario/methodology design that allows the user to slowly ease into the VR experience without feeling too self-conscious. Additional guidelines on how to provide a more comfortable experience for participants using VR in public are presented in [284].

8.2.2 At the target location

Depending on the intended goal, a VR application may be developed for personal use, or as an education/visualization tool. By moving the study to the target location (i.e., conducting a field study) such as a school, or a living space, researchers can avoid the negative impact of artificial laboratory setting on the external validity of the study, therefore achieving a higher level of experimental realism [139]. Using the examined system, service or application in an environment that is perceived as more natural mitigates the issue of participant reactivity (i.e., display of modified participant behaviours resulting from the awareness that they are being observed/tested [139]). However, while improving external validity, moving away from the sterile laboratory environment tends to decrease internal validity, as it becomes harder to control environmental variables of the study [139]. Additionally, based on our experience in conducting VR user studies, we would argue that providing a large number of participants/locations with expensive VR equipment, lending devices for a longer period of time, or transporting complicated VR setups to a target location may be problematic for the institution conducting the study, which greatly impacts the scope of the field study in terms of used devices, population size, study duration, observed IFs, etc. The same availability issues hold true for the use of devices for physiological data collection. Thus, researchers may have to rely solely on self-reported data.

A large percentage of VR applications is intended for personal use in a private space, but the option of conducting VR studies from the users’ homes is especially important to consider in the context of current events regarding the COVID-19 pandemic, as reaserchers struggle with limited access to laboratories and public spaces, and hygiene concerns regarding shared VR HMD use. While evaluating the use of VR in home conditions is slowly becoming more achievable, as the number of casual VR users has started to increase over the last several years [269], VR owners are still a definite minority. When conducting from-home studies, the majority of participants will have to be provided with the necessary equipment, which, as discussed, tends to be highly impractical and/or financially straining for research institutions conducting the experiment, especially in terms of more advanced VR systems. Depending on the goal of the study, a more achievable solution may be to focus on mobile VR, which is less expensive, standalone (i.e., does not rely on a VR-ready computer), and easy to set up. However, a more promising solution for this issue may be found in the use of crowdsourcing for QoE assessment [143], leveraging Internet platforms for the recruitment of VR owners for participation in online studies.

For instance, Steed et al. [179] conducted a field study exploring presence and embodiment in immersive VR using mobile VR platforms. With respect to limitations of popular mobile VR devices, the content was designed to be non-interactive as a way to mitigate the chance of injury and adverse reactions, as this was a public, non-supervised study. The application was distributed via app stores. While this type of distribution made it available to a large number of customers, data collection (answers to two questionnaires, device information, head-tracking information) was performed only in case of consenting users. The benefit of this method of gathering participants is the broadening of the population set in comparison to a typical laboratory study. However, the authors note that the issue of test material design is more relevant in case it is being distributed in such a public way (compared to laboratory studies), considering that it has to rely on visual attractiveness and content quality in order to stand out from other VR applications, while avoiding elements that may provoke a stressful response due to ethical reasons. Recent examples of the use of crowdsourcing in VR user research are presented in [127, 285]. However, in cases where an official supervisor is not present to monitor the use of an application, participants should still be monitored (e.g., by a family member, friend, or colleague) to prevent injury.

9 Summary of key challenges

9.1 Identifying influence factors and features to be used for assessing and modeling QoE

As listed in Sect. 3, there are multiple factors responsible for the formation of VR user experience. Some of those factors are relevant to other types of audio-visual services, even non-interactive ones, while others are specific to immersive interactive VR. By identifying key factors and examining their influence on different QoE features, researchers are able to make adjustments to their study design, and collect data to be used for QoE modeling. Careful consideration of the VR market, especially in light of recent findings pertaining to VR user acceptance, helps with narrowing the focus towards most relevant aspects of the VR experience. Recent technological advancements regarding commercial VR technology, as well as the arrival of 5G networks, highlight the need for further research observing the influence of system IFs on the overall QoE.

In addition to its immersivity, VR use is characterized by higher levels of discomfort compared to less intrusive platforms. Researchers should consider examining VR-induced discomfort as an important feature to incorporate into UX/QoE and technology acceptance models. Additionally, while the existing body of work addressing cybersickness is relatively large, recent findings call for a shift towards exploring other symptoms of discomfort (i.e., digital eye strain, ergonomic factors related to headset design, control modality, and interaction) and fatigue, as well as cognitive performance aftereffects of VR use.

9.2 Defining the test methodology

While it is unethical to expose participants to situations that may bring significant or long-term psychological or physiological harm, some researchers argue that it is necessary to include more sensitive populations in VR research, as it provides valuable information which can be used to adapt existing systems to their specific needs. Therefore, there is a need for guidelines regarding pre-screening methods and exclusion criteria, with respect to ethical issues. Considering university students are generally over-represented in user research, efforts should be made towards broadening the participant population in future studies. However, this calls for additional research and methodology guidelines pertaining to more sensitive demographics, such as children and the elderly. Additionally, researchers should focus on including participants with various levels of experience, and work towards determining the impact of the novelty effect on the overall QoE.

Standardized test material facilitates comparison between studies, as well as the reproduction of study results. Therefore, efforts should be invested towards designing test applications which include relevant interaction methods, and are suitable for use with different types of I/O devices.

Subjective methods are commonly used in VR user studies. While most researchers use paper questionnaires, incorporating questionnaires into VEs used for testing should be encouraged. Even though they offer valuable information, subjective methods should be combined with objective methods for more relevant results, and their mutual relationship should be examined. Even though there is a significant number of commonly used questionnaires (mostly related to cybersickness and presence), there is still room for improvement with regards to addressing specific dimensions of VR use. In terms of objective measures, new technology (e.g., eye-tracking in VR HMDs) facilitates the development of novel methods for assessing user experience.

Considering VR is generally more physically exhausting and more likely to induce cybersickness compared to most other platforms, the recommended duration of each episode of use is up for debate, with certain sources recommending time frames as low as 15 min. Unfortunately, performing a user study while limiting VR exposure to such a short duration also limits the validity of its results, as they may not be representative of realistic, long-term use. Therefore, there is a need for guidelines addressing study duration. Additionally, there is currently a deficit of studies exploring the effect of prolonged VR exposure, as well as a deficit of longitudinal VR studies. On a similar note, the majority of QoE studies is conducted in a sterile laboratory environment. Along with extending the observed time frame, conducting research in a more realistic setting is likely to result in valuable insights and greater external validity. Recent events and regulations related to the COVID-19 pandemic highlight the importance of considering the use of crowdsourcing to facilitate VR user research.

10 Conclusion

In this paper we have provided an overview of perception-based QoE assessment for interactive VR applications, organized into sections discussing the motivation behind VR user research, relevant IFs and QoE features, pre-screening and participant choice, test material, subjective and objective measures, as well as study duration and preparation of appropriate study environment. Guided by the multimodality of the VR platform, along with its wide array of potential uses, we have based this overview on sources stemming from various branches of science. Bringing together key findings from literature and existing standards, we have presented a collection of resources, explanations, and recommendations to serve as a reference for academic and industry researchers interested in conducting VR user studies. Based on our findings, we have summarized key challenges to be addressed in future research: identifying IFs and features to be used for QoE modeling, as well as addressing different ethical and practical aspects of methodology design for VR user research. We note, however, that each of the presented elements of perception-based QoE assessment requires its own in-depth review, as the aim of this paper was to provide only a concise, high-level overview of the topic.