MTRNet++: One-stage mask-based scene text eraser

https://doi.org/10.1016/j.cviu.2020.103066Get rights and content

Highlights

  • The proposed MTRNet++ has a novel one-stage mask-based architecture.

  • MTRNet++ achieves state-of-the-art results on the Oxford and SCUT datasets.

  • MTRNet++ is end-to-end trainable. It converges on a large-scale dataset within an epoch.

  • MTRNet++ demonstrates controllability and interpretability.

  • We introduced some incremental modifications regarding training losses and strategy.

Abstract

A precise, controllable, interpretable and easily trainable text removal approach is necessary for both user-specific and large-scale text removal applications. To achieve this, we propose a one-stage mask-based text inpainting network, MTRNet++. It has a novel architecture that includes mask-refine, coarse-inpainting and fine-inpainting branches, and attention blocks. With this architecture, MTRNet++ can remove text either with or without an external mask. It achieves state-of-the-art results on both the Oxford and SCUT datasets without using external ground-truth masks. The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential. It also demonstrates controllability and interpretability.

Introduction

Text removal is the task of inpainting text regions in scenes with semantically correct backgrounds. It is useful for privacy protection, image/video editing, and image retrieval (Tursun et al., 2019a). Recent studies with advanced deep learning models (Zhang et al., 2018, Tursun et al., 2019b, Nakamura et al., 2017) explore removing text from real world scenes. Previously, text removal with traditional methods has only been effective for text in fixed positions with a uniform background from digital-born content.

Text removal is a challenging task as it inherits the challenges of both text-detection and inpainting tasks. Zhang et al. (2018) claims that text removal requires stroke-level text localisation, which is a harder and less studied topic compared to bounding-box-level scene text detection (Gupta et al., 2016). On the other hand, realistic inpainting requires replacing/filling unwanted objects in scenes with perceptually plausible content. Deep learning approaches generate perceptually plausible content by learning the distribution of training data. Text can be placed in any region or on any object, and as such a text inpainting model is required to learn a wide range of distributions, exacerbating the challenge.

Recent text removal studies are based around two main paradigms: one-stage and two-stage methods. The one-stage approach uses an end-to-end model, which is free of auxiliary inputs; however, its generality, flexibility and interpretability are limited. EnsNet (Zhang et al., 2018) is the state-of-the-art one-stage text removal method. It removes all text that is written in scripts on which it was trained. Although it supports selective text removal with some extra steps (cropping or overlapping), this introduces new issues such as colour discontinuities or loss of context. Moreover, it lacks interpretability that makes fixing failure cases troublesome. Finally, yet importantly, we found that a one-stage approach without an explicit text localisation mechanism fails to converge on a large-scale dataset within a few epochs. For example, Pix2Pix (Isola et al., 2017) and EnsNet do not converge on a large-scale dataset with the same amount of training as their counterparts did according to the results reported in MTRNet (Tursun et al., 2019b) and this work.

The two-stage approach decomposes the text removal task into text detection and text inpainting sub-problems. Text detection can be manual or automatic. MTRNet (Tursun et al., 2019b), for example, is a representative inpainting stage of the two-stage approach with a deep neural network. The main advantage of a two-stage approach is that it has an awareness of the text regions that require inpainting. With explicit text regions, it gains generality and controllability. Two stage approaches can be effortlessly adapted to remove text in various scripts, and are able to remove or keep text based on selection. Such methods have strong interpretability as well; for example it is easier to understand if failure cases are caused by inaccurate detection or poor inpainting. Despite the promise of two-stage methods, they are limited in that they rely on a text localisation front-end. Compared to a one-stage approach, two stage approaches are inefficient and have a complex training process as at least two networks need to be trained.

In this work, we propose a one-stage text removal approach, MTRNet++. It can remove text either with a mask as per MTRNet, or without a mask as per EnsNet as shown in Fig. 1. It inherits the idea of MTRNet of using a text region mask as an auxiliary input. However, the network has a different architecture to MTRNet, and is composed of mask-refine, coarse-inpainting, and fine-inpainting branches. The mask-refine and coarse-inpainting branches generate intermediate results, while the fine-inpainting branch generates the final refined results. Moreover, compared to MTRNet, it can remove text under very coarse masks as shown in Figs. 1c and 1d.

The mask-refine branch is designed to refine a coarse mask into an accurate pixel-level mask. The mask-refining branch is introduced for the following reasons: (1) To remove text as EnsNet does without requiring third-party text-localisation information, but also to allow utilisation of third-party information as done by MTRNet. (2) To provide interpretability, flexibility and a generalisation ability. (3) To provide attention scores to the coarse-inpainting branch. (4) To speed up the convergence of the network on a large-scale dataset.

The coarse-inpainting branch is a parallel branch to the mask-refine branch, which performs a coarse inpainting using the coarse mask. The coarse-to-fine framework has been shown to be beneficial for realistic inpainting (Yu et al., 2018b, Ma et al., 2019), and the coarse-branch is guided by the attention scores generated by the attention blocks that map intermediate features of the mask-refine branch to weights.

The fine-inpainting branch is introduced for refining the results of the coarse-inpainting branch with precise masks from the mask-refine branch. The coarse results are blurry and lack details. The fine-inpainting branch increases inpainting quality. In this work, for efficiency, a light sub-network is used as the fine-inpainting branch.

In summary, our contributions are as follows:

  • We propose a novel one-stage architecture for text removal. With this architecture, MTRNet++ is free from external text localisation methods, yet can also leverage external information.

  • MTRNet++ achieves state-of-the-art quantitative and qualitative results both on Oxford (Gupta et al., 2016) and SCUT (Zhang et al., 2018) datasets without external masks. Ablation studies shows the proposed architecture and its components play important roles.

  • MTRNet++ is a fully end-to-end trainable network and is easily trainable. It converges on large-scale datasets within an epoch. It also demonstrates controllability and interpretability.

We also introduce other incremental modifications regarding training losses, training strategy and the discriminator, which will be discussed in Section 3.

The rest of the paper is organised as follows. The next section presents related literature. Section 3 illustrates the proposed network architecture (generator and discriminator), training losses and training strategy. Section 4 presents experiments, ablation studies and analysis. Finally, we provide a brief summarisation of our work in Section 5.

Section snippets

Literature

Text removal is a special case of image inpainting, which usually requires the assistance of a text-detection method for text localisation. Early text removal approaches (Khodadadi and Behrad, 2012, Wagh and Patil, 2015, Tursun et al., 2019a) are two-stage methods based on either traditional text detection or inpainting approaches. With the advance of deep learning, many classical problems including image inpainting are solved in a single stage using a deep encoder–decoder neural network (Mao

MTRNet++

MTRNet++ is text-inpainting network, and is formulated as a conditional generative adversarial network (Isola et al., 2017). It is composed of a multi-branch generator G and a discriminator D. In the following sections, we illustrate their structures and training strategy.

Datasets and evaluation metrics

In this study, MTRNet++ is primarily compared with the previous state-of-the-art methods, MTRNet and EnsNet. For a fair comparison, the same datasets and evaluation metrics introduced by MTRNet and EnsNet are applied. The comparison includes quantitative and qualitative results. Quantitative results are given for synthetic datasets that have ground-truth, while qualitative visualisations are provided for both synthetic and real datasets. Most of the datasets used for training and evaluation are

Conclusion

In this work, a one-stage mask-based conditional generative adversarial network, MTRNet++, is proposed for real-scene text removal. It is self-complete, controllable and interpretable. It shows state-of-the-art quantitative results on both the Oxford and SCUT test datasets without user-provided text masks. Visual results also show MTRNet++ generates realistic inpainting for text regions. Related ablation studies show that proposed multi-branch generator is essential for state-of-the-art

CRediT authorship contribution statement

Osman Tursun: Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation. Simon Denman: Supervision, Writing - review & editing, Conceptualization, Methodology. Rui Zeng: Conceptualization, Methodology. Sabesan Sivapalan: Conceptualization, Methodology, Supervision. Sridha Sridharan: Writing - review & editing, Supervision, Project design. Clinton Fookes: Supervision, Project design, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by an Advance Queensland Research Fellowship Award, Australia .

References (33)

  • WolfC. et al.

    Evaluation of video activity localizations integrating quality and quantity measurements

    Comput. Vis. Image Underst. (CVIU)

    (2014)
  • Baek, Y., Lee, B., Han, D., Yun, S., Lee, H., 2019. Character region awareness for text detection. In: Proceedings of...
  • Eigen, D., Krishnan, D., Fergus, R., 2013. Restoring an image taken through a window covered with dirt or rain. In:...
  • Gupta, A., Vedaldi, A., Zisserman, A., 2016. Synthetic data for text localisation in natural images. In: Proc. of...
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...
  • IizukaS. et al.

    Globally and locally consistent image completion

    ToG

    (2017)
  • IsolaP. et al.

    Image-to-image translation with conditional adversarial networks

    CVPR

    (2017)
  • JoY. et al.

    Sc-fegan: Face editing generative adversarial network with user’s sketch and color

    (2019)
  • JohnsonJ. et al.

    Perceptual losses for real-time style transfer and super-resolution

  • KaratzasD. et al.

    Icdar 2013 robust reading competition

  • KhodadadiM. et al.

    Text localization, extraction and inpainting in color images

  • Liu, G., Reda, F.A., Shih, K.J., Wang, T.-C., Tao, A., Catanzaro, B., 2018. Image inpainting for irregular holes using...
  • MaY. et al.

    Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation

  • Maas, A.L., Hannun, A.Y., Ng, A.Y., 2013. Rectifier nonlinearities improve neural network acoustic models. In: Proc....
  • MaoX.-J. et al.

    Image restoration using convolutional auto-encoders with symmetric skip connections

    (2016)
  • MiyatoT. et al.

    Spectral normalization for generative adversarial networks

    (2018)
  • Cited by (42)

    • Progressive scene text erasing with self-supervision

      2023, Computer Vision and Image Understanding
    • Video frame interpolation via down–up scale generative adversarial networks

      2022, Computer Vision and Image Understanding
      Citation Excerpt :

      This method produced more disturbing artifacts than others, although it derived preferable output for video sequences that contain motion blurs and brightness changes. GANs have found wide applications, including realistic image generation (Karras et al., 2018; Brock et al., 2018), noisy image generation (Chang et al., 2020; Chen et al., 2020), inpainting (Demir and Unal, 2018; Tursun et al., 2020), and text-to-image generation (Zhang et al., 2018; Yanagi et al., 2019), and also video frame generation (Mathieu et al., 2016; Vondrick et al., 2016; Li et al., 2018). Li et al. (2018) proposed a fully CNN-based generator and a discriminator for frame interpolation.

    View all citing articles on Scopus
    View full text