Elsevier

Computers & Graphics

Volume 95, April 2021, Pages 141-155
Computers & Graphics

Technical Section
An augmented crowd simulation system using automatic determination of navigable areas

https://doi.org/10.1016/j.cag.2021.01.012Get rights and content

Highlights

  • We propose an augmented crowd simulation system using automatic determination and reconstruction of navigable areas in static, surveillancealike videos.

  • We utilize pedestrian trajectory data and use deep learning-based semantic segmentation methods to identify navigable areas.

  • We simulate artificial agents over the reconstructed navigable area together with real agents in the video via collision avoidance.

  • We demonstrate the accuracy and applicability of the proposed navigable area reconstruction approach on various crowded outdoor scenarios.

Abstract

Crowd simulations imitate the group dynamics of individuals in different environments. Applications in entertainment, security, and education require augmenting simulated crowds into videos of real people. In such cases, virtual agents should realistically interact with the environment and the people in the video. One component of this augmentation task is determining the navigable regions in the video. In this work, we utilize semantic segmentation and pedestrian detection to automatically locate and reconstruct the navigable regions of surveillance-like videos. We place the resulting flat mesh into our 3D crowd simulation environment to integrate virtual agents that navigate inside the video avoiding collision with real pedestrians and other virtual agents. We report the performance of our open-source system using real-life surveillance videos, based on the accuracy of the automatically determined navigable regions and camera configuration. We show that our system generates accurate navigable regions for realistic augmented crowd simulations.

Introduction

Crowd simulations investigate the interaction of individuals inside and among groups of people, in terms of behavior, appearance, personality, and emotions. Models used in such simulations aim for a realistic interaction with the environment; hence, the appearance and behavior of the virtual agents that represent individuals should fit the context of the scene for better immersion. Quantitative methods assess the realism of such simulations, comparing the simulated crowd with real-world data.

Augmenting virtual crowds into real-life videos has applications in entertainment, security, and education. Virtual crowds can cost-effectively fill environments in movies, appearing together with real actors; virtual tutors can move inside live environments to create immersion in training applications. In such augmented crowd simulations, virtual agents should be indistinguishable from the real people and should interact with the real crowd and the environment realistically. This requires careful inspection of the environment and the individuals in the video.

Augmented crowd simulations benefit from data-driven approaches for pedestrian and scene inference. Using a model for the environment and pedestrian trajectories, a virtual crowd can be augmented into the input video, so that virtual agents plausibly move in the scene without colliding with each other and the real pedestrians. However, in such a workflow, many steps require labor-intensive manual processing, including the construction of an environment model for the virtual crowd.

We introduce our open-source augmented crowd simulation system that utilizes an automated approach for the determination and reconstruction of navigable regions in real-life surveillance-like videos. We make use of existing methods of semantic segmentation and pedestrian tracking to determine image-level navigable regions. Then we reconstruct the aerial view of these regions as a flat mesh and position it in our 3D crowd simulation environment. From the perspective of the automatically calibrated scene camera, the virtual agents move inside the navigable regions of the video, avoiding scene obstacles, real pedestrians, and other virtual agents. We evaluate the accuracy of the generated navigable regions in comparison to the ground truth, using real-life surveillance videos.

We list our contributions as:

  • Automatic determination and reconstruction of image-level navigable areas in surveillance-like videos for seamless integration of virtual agents.

  • Evaluation of the resulting image-level navigable areas using different combinations of segmentation networks and training sets.

Section snippets

Data-driven crowd simulations

Many applications of crowd simulations utilize real-life data for realistic agent behavior. Musse et al. [1], Lerner et al. [2], and Kim et al. [3] extract pedestrian trajectories from real-life video sequences to simulate movements of virtual agents in various crowd scenarios. Jablonski et al. [4] evaluate the accuracy and the realism of crowd simulations in comparison to real-life footage, using pedestrian flow. Amirian et al. [5], [6] generate crowd trajectories that mimic the behavior of

Framework

Our open-source framework, outlined in Fig. 1, provides an augmented interactive crowd simulation in Unity [29]. We simulate virtual agents walking in navigable regions of the input video while avoiding collision with real pedestrians. To reconstruct the navigable scene, we preprocess the input video using computer vision techniques included in the OpenCV library [30]. The crowd simulation runs in real-time, and the preprocessing is performed off-line.

The input of our system is the video of an

Navigable area reconstruction

We infer the navigable regions of the scene and generate a 2D navigation mesh based on the union of these regions in an aerial view. We position the 2D mesh into our simulation environment with the camera configuration of the video, so that the virtual agents that walk on the navigation mesh appear as if they are walking on the navigable regions of the video. This reconstruction process involves the following steps:

  • 1.

    We analyze the video frames to determine the navigable regions using

Evaluation

We test our framework on various stationary surveillance-like videos including PETS09-S2L1 [44], Town Centre [62], MOT16-04 [31], and a custom video. Fig. 8 includes the horizon, extracted navigable areas, their placement into the 3D scene with dummy agents, and the final output with virtual agents for each test video.

Table 1 shows a quantitative comparison of different pedestrian trackers for PETS09-S2L1 and our custom video. Recall is the percentage of identified pedestrians overall in the

Conclusion

We introduce an open-source augmented crowd simulation system that utilizes automatic determination of navigable regions in surveillance-like videos. The GitHub project including the repositories that contain the source codes of the proposed system is located at https://github.com/users/YalimD/projects/2. We combine existing techniques of semantic segmentation and pedestrian detection for accurate determination of the navigable regions. We compare our results with the ground truth and manually

Declaration of Competing Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are grateful to Lori Russell-Dağ and İpek Sözen for proofreading the manuscript.

References (65)

  • F. Zheng et al.

    ARCrowd-a tangible interface for interactive crowd simulation

    Proceedings of the 16th international conference on intelligent user interfaces. IUI ’11;

    (2011)
  • P. Baiget et al.

    Generation of augmented video sequences combining behavioral animation and multi-object tracking

    Comput Anim Virtual Worlds

    (2009)
  • J.I. Rivalcoba et al.

    Coupling camera-tracked humans with a simulated virtual crowd

    Proceedings of the 9th international conference on computer graphics theory and applications. GRAPP ’14;

    (2014)
  • Y. Zhang et al.

    Online inserting virtual characters into dynamic video scenes

    Comput Anim Virtual Worlds

    (2011)
  • Y. Doğan et al.

    Augmentation of virtual agents in real crowd videos

    Signal Image Video Process

    (2019)
  • B. Li et al.

    Vanishing point detection using cascaded 1D Hough Transform from single images

    Pattern Recognit Lett

    (2012)
  • M. Zhai et al.

    Detecting vanishing points using global image context in a non-Manhattan world

    Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR ’16

    (2016)
  • T. Trocoli et al.

    Using the scene to calibrate the camera

    Proceedings of the 29th SIBGRAPI conference on graphics, patterns and images. SIBGRAPI’16

    (2016)
  • J. Liu et al.

    Surveillance camera autocalibration based on pedestrian height distributions

    British machine vision conference

    (2011)
  • G.M. Brouwers et al.

    Automatic calibration of stationary surveillance cameras in the wild

    European conference on computer vision. ECCV ’16;

    (2016)
  • D. Liebowitz et al.

    Metric rectification for perspective images of planes

    Proceedings of the IEEE computer society conference on computer vision and pattern recognition. CVPR ’98;

    (1998)
  • D. Liebowitz et al.

    Creating architectural models from images

    Comput Graph Forum

    (1999)
  • B. Bose et al.

    Ground plane rectification by tracking moving objects

    Proceedings of the joint IEEE International workshop on visual surveillance and performance evaluation of tracking and surveillance

    (2003)
  • K. Chaudhury et al.

    Auto-rectification of user photos

    IEEE international conference on image processing. (ICIP ’14)

    (2014)
  • A. Bulbul et al.

    Populating virtual cities using social media

    Comput Anim Virtual Worlds

    (2017)
  • S. Iizuka et al.

    Efficiently modeling 3D scenes from a single image

    IEEE Comput Graph Appl

    (2012)
  • G. Zhang et al.

    As-consistent-as-possible compositing of virtual objects and video sequences

    Comput Anim Virtual Worlds

    (2006)
  • D. Hoiem et al.

    Automatic photo pop-up

    ACM Trans Graph

    (2005)
  • A. Saxena et al.

    Make3D: learning 3D scene structure from a single still image

    IEEE Trans Pattern Anal MachIntell

    (2009)
  • Team U.. Unity. Accessed 07 Sep. 2020a. Available at...
  • Team O.. OpenCV (open source computer vision library). Accessed 07 Sep. 2020b. Available at...
  • A. Milan et al.

    MOT16: a benchmark for multi-object tracking

    CoRR

    (2016)
  • Cited by (0)

    This paper was recommended for publication by Stefanie Zollmann.

    View full text