Deep-STaR: Classification of image time series based on spatio-temporal representations

https://doi.org/10.1016/j.cviu.2021.103221Get rights and content

Highlights

  • Classification of image time series.

  • Spatio-temporal planar representation of 2D+t image data.

  • Learning spatio-temporal features with 2D convolutional neural networks.

  • Visual post-hoc attention for spatio-temporal result interpretation.

Abstract

Image time series (ITS) represent complex 3D (2D+t in practice) data that are now daily produced in various domains, from medical imaging to remote sensing. They contain rich spatio-temporal information allowing the observation of the evolution of a sensed scene over time. In this work, we focus on the classification task of ITS, as often available in remote sensing tasks. An underlying problem here is to consider jointly the spatial and the temporal dimensions of the data. We present Deep-STaR, a method to learn such features from ITS data to proceed to their classification. Instead of reasoning in the original 2D+t space, we investigate novel 2D planar data representations, containing both temporal and spatial information. Such representations are a novel way to structure the ITS, compatible with deep learning architectures. They are used to feed a convolutional neural network to learn spatio-temporal features with 2D convolutions, leading ultimately to classification decision. To enhance the explainability of the results, we also propose a post-hoc attention mechanism, enabled by this new approach, providing a semantic map giving some insights for the taken decision. Deep-STaR is evaluated on a remote sensing application, for the classification of agricultural crops from satellite ITS. The results highlight the benefice of this method, compared to the literature, and its interest to make easier the interpretation of ITS to understand spatio-temporal phenomena.

Introduction

The multiplicity of sensors, coupled with the society appetites (e.g., industrial, scientific, leisure) in image content, leads to the production of mass of visual data. They have to be processed, analyzed, understood automatically for indexing or classification purposes. In some cases, this visual data are 3D (2D+t in practice) data when the sensors produce images of a scene at different times.

Such data sources are varied and many applications could benefit from them. In remote sensing, optical satellite sensors image certain regions every week. These data are used for environmental studies or land-cover mapping. For example, the Sentinel-2 Earth Observation satellite constellation provides image sequences over the same geographical area with high spatial, spectral and temporal resolutions around the globe (Drusch et al., 2012). In medicine, radiology imaging devices are used to follow each month the evolution of a pathology in a patient for longitudinal studies (Madhyastha et al., 2018). In biology, a camera fixed on a microscope can be employed to analyze the cell developments (Stuurman and Vale, 2016), etc.

The produced 2D+t data carry rich spatial and temporal information that must be taken into account to understand particular phenomena not being observable from a single image of the sequence (e.g., vegetation seasonal development from satellite images, tumor remission in medicine) (Ren et al., 2009, Sumpter and Bulpitt, 2000, Weng et al., 2019).

Whether considering a stack of images or a video, we will denote these 2D+t data as Image Time Series (ITS) in the following. An ITS is basically a set of images of the same scene, ordered chronologically. It can be encoded as a data-cube, two spatial and one temporal dimensions. The acquisition of an ITS can be done with one or multiple sensors to obtain a larger data series with a high temporal frequency.

In this work, we consider a classification task where, given an ITS representing a scene or a particular object, a class label potentially linked to an evolution along time, has to be predicted. In addition, depending on the scene, moving objects or deformable content can be represented.

The analysis of an ITS generally requires the extraction, from image pixels, of visual features as discriminating as possible. In the literature, some approaches focus rather on the temporal aspect. They consider an ITS as a set of independent pixels characterized with their time series (i.e.1D temporal pixels) and classified individually. In a supervised classification scheme, this has the advantage of providing many learning examples to train a model. The spatial aspect of data is then totally ignored. Nevertheless, in various applications, this aspect is necessary to discriminate certain complex classes. The joint study of the spatial and temporal domains may allow a finer analysis and a better understanding of some phenomena which can characterize the studied objects of interest and their evolution. In this context, some approaches combine spatial and temporal features. Often, the two domains are processed independently and a fusion is operated for the decision. There are also approaches that directly take into account spatio-temporal features calculated from the data-cube, e.g., convolutional features obtained from a 3D Deep Neural Network (DNN). Such features are then natively spatio-temporal but training such models is expensive.

The problem studied in this article is the extraction of spatio-temporal features from ITS and their involvement in a deep classification procedure. Our methodological contribution is twofold:

  • we propose Deep-STaR, a method dedicated to ITS classification. We investigate novel planar representations of the 2D+t ITS data, involving both temporal and spatial information. The original 2D spatial dimension of the ITS is embedded in a 1D structure trying to preserve the pixels spatial configuration. Such 1D spatial structure is coupled with the 1D original temporal domain of the ITS in a 2D (planar) spatio-temporal representation, leading to a novel way to structure the ITS, making easier calculus and interpretation. This new representation is used to feed a Convolutional Neural Network (CNN) to learn spatio-temporal features, resulting ultimately to classification decisions;

  • we investigate an attention mechanism, integrated in our system, providing a semantic map explaining the decision. The main originality is to embed the attention information in the original ITS dimensions. This constitutes a plus value regarding the state-of-the-art since attention was mainly studied in the spatial or temporal domains.

The remainder of this article is organized as follows. Section 2 introduces some related works. In Section 3, we present the Deep-STaR method: firstly, the proposed 2D spatio-temporal representations, secondly, the proposed attention mechanism. Sections 4 Experimental study in remote sensing, 5 Results and discussion present an experimental study in remote sensing and a discussion of the results, coupled with a comparative study. Finally, conclusion and perspectives will be found in Section 6.

Section snippets

Related works for ITS analysis

Numerous approaches exist in the literature for ITS analysis. Depending on the task and the application field (e.g., remote sensing, medical imaging, video analysis). We focus here on the features and the adopted point of view (i.e., dimension). We distinguish three groups of approaches, presented hereinafter: (1) those treating ITS as a set of pixel time series, (2) those integrating spatial information in the analysis, and (3) those exploiting more directly spatio-temporal features.

Deep-STaR: ITS analysis from spatio-temporal representations

This section presents the methodological foundation of the proposed Deep-STaR method for 2D+t image time series to predict a semantic (class) label from an input ITS. Fig. 1 illustrates the workflow of Deep-STaR.

The ITS can be either a (rectangular) patch representing a complete scene (see left of Fig. 1), or only a region of interest (ROI), a connected set of pixels in the image domain (see Fig. 7). We assume that all pixels of the patch/ROI share the same label. In the following, we will use

Experimental study in remote sensing

Deep-STaR is experimented on a remote sensing application. Recently, new Earth Observation satellite constellations sense masses of satellite image time series (SITS). The Sentinel-2 provides image sequences over a geographical area with high spatial, spectral and temporal resolutions. Such 2D+t imaging data are useful for agricultural and environmental policy makers, since they enable for example the control of agricultural crop-fields at large-scale to check the annual farmers declarations.

Results and discussion

We discuss here the classification results on the remote sensing application, with the local MS-STR and global G-STR approaches and present some comparisons with selected competitive methods from the state-of-the-art. We finally provide visual results obtained with the attention mechanisms.

Conclusion

In this work, we have proposed the Deep-STaR method designed for image time series classification. Thanks to a remodeling of the image time series into a planar spatio-temporal representation, spatial relationship of pixels is partially preserved, without losing the temporal information and native spatio-temporal features are learned while training a classical 2D CNN. The use of a 2D CNN allows to benefit of pre-learned weights, extracted from ImageNet and fine-tuned with specific data. Two

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The French ANR supported this work under Grant ANR-17-CE23-0015.

References (52)

  • BagnallA. et al.

    The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances

    Data Min. Knowl. Discov.

    (2017)
  • BaillyA. et al.

    Dense bag-of-temporal-SIFT-words for time series classification

  • BruzzoneL. et al.

    Automatic analysis of the difference image for unsupervised change detection

    IEEE Trans. Geosci. Remote Sens.

    (2000)
  • ButzA.

    Alternative algorithm for Hilbert’s space-filling curve

    IEEE Trans. Comput.

    (1971)
  • ChandraS. et al.

    Deep spatio-temporal random fields for efficient video segmentation

  • ChattopadhyayA. et al.

    Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks

  • ChelaliM. et al.

    Urban land cover analysis from satellite image time series based on temporal stability

  • ChelaliM. et al.

    Image time series classification based on a planar spatio-temporal data representation

  • CoppinP. et al.

    Digital change detection methods in ecosystem monitoring: A review

    Int. J. Remote Sens.

    (2004)
  • CorreaY.T.S. et al.

    A method for the analysis of small crop fields in sentinel-2 dense time series

    IEEE Trans. Geosci. Remote Sens.

    (2020)
  • Di MauroN. et al.

    End-to-end learning of deep spatio-temporal representations for satellite image time series classification

  • FalcoN. et al.

    Change detection in VHR images based on morphological attribute profiles

    IEEE Geosci. Remote Sens. Lett.

    (2013)
  • FawazH.I. et al.

    Deep learning for time series classification: a review

    Data Min. Knowl. Discov.

    (2019)
  • FeichtenhoferC. et al.

    Spatiotemporal multiplier networks for video action recognition

  • GoroshinR. et al.

    Unsupervised feature learning from temporal data

  • HuangB. et al.

    Large-scale semantic classification: Outcome of the first year of inria aerial image labeling benchmark

  • 1

    All authors have contributed in an equal way to the different steps of the elaboration of the paper.

    View full text