MTRNet++: One-stage mask-based scene text eraser

doi:10.1016/j.cviu.2020.103066

Computer Vision and Image Understanding

Volume 201, December 2020, 103066

https://doi.org/10.1016/j.cviu.2020.103066 Get rights and content

Highlights

•
The proposed MTRNet++ has a novel one-stage mask-based architecture.
•
MTRNet++ achieves state-of-the-art results on the Oxford and SCUT datasets.
•
MTRNet++ is end-to-end trainable. It converges on a large-scale dataset within an epoch.
•
MTRNet++ demonstrates controllability and interpretability.
•
We introduced some incremental modifications regarding training losses and strategy.

Abstract

A precise, controllable, interpretable and easily trainable text removal approach is necessary for both user-specific and large-scale text removal applications. To achieve this, we propose a one-stage mask-based text inpainting network, MTRNet++. It has a novel architecture that includes mask-refine, coarse-inpainting and fine-inpainting branches, and attention blocks. With this architecture, MTRNet++ can remove text either with or without an external mask. It achieves state-of-the-art results on both the Oxford and SCUT datasets without using external ground-truth masks. The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential. It also demonstrates controllability and interpretability.

Introduction

Text removal is the task of inpainting text regions in scenes with semantically correct backgrounds. It is useful for privacy protection, image/video editing, and image retrieval (Tursun et al., 2019a). Recent studies with advanced deep learning models (Zhang et al., 2018, Tursun et al., 2019b, Nakamura et al., 2017) explore removing text from real world scenes. Previously, text removal with traditional methods has only been effective for text in fixed positions with a uniform background from digital-born content.

Text removal is a challenging task as it inherits the challenges of both text-detection and inpainting tasks. Zhang et al. (2018) claims that text removal requires stroke-level text localisation, which is a harder and less studied topic compared to bounding-box-level scene text detection (Gupta et al., 2016). On the other hand, realistic inpainting requires replacing/filling unwanted objects in scenes with perceptually plausible content. Deep learning approaches generate perceptually plausible content by learning the distribution of training data. Text can be placed in any region or on any object, and as such a text inpainting model is required to learn a wide range of distributions, exacerbating the challenge.

Recent text removal studies are based around two main paradigms: one-stage and two-stage methods. The one-stage approach uses an end-to-end model, which is free of auxiliary inputs; however, its generality, flexibility and interpretability are limited. EnsNet (Zhang et al., 2018) is the state-of-the-art one-stage text removal method. It removes all text that is written in scripts on which it was trained. Although it supports selective text removal with some extra steps (cropping or overlapping), this introduces new issues such as colour discontinuities or loss of context. Moreover, it lacks interpretability that makes fixing failure cases troublesome. Finally, yet importantly, we found that a one-stage approach without an explicit text localisation mechanism fails to converge on a large-scale dataset within a few epochs. For example, Pix2Pix (Isola et al., 2017) and EnsNet do not converge on a large-scale dataset with the same amount of training as their counterparts did according to the results reported in MTRNet (Tursun et al., 2019b) and this work.

The two-stage approach decomposes the text removal task into text detection and text inpainting sub-problems. Text detection can be manual or automatic. MTRNet (Tursun et al., 2019b), for example, is a representative inpainting stage of the two-stage approach with a deep neural network. The main advantage of a two-stage approach is that it has an awareness of the text regions that require inpainting. With explicit text regions, it gains generality and controllability. Two stage approaches can be effortlessly adapted to remove text in various scripts, and are able to remove or keep text based on selection. Such methods have strong interpretability as well; for example it is easier to understand if failure cases are caused by inaccurate detection or poor inpainting. Despite the promise of two-stage methods, they are limited in that they rely on a text localisation front-end. Compared to a one-stage approach, two stage approaches are inefficient and have a complex training process as at least two networks need to be trained.

In this work, we propose a one-stage text removal approach, MTRNet++. It can remove text either with a mask as per MTRNet, or without a mask as per EnsNet as shown in Fig. 1. It inherits the idea of MTRNet of using a text region mask as an auxiliary input. However, the network has a different architecture to MTRNet, and is composed of mask-refine, coarse-inpainting, and fine-inpainting branches. The mask-refine and coarse-inpainting branches generate intermediate results, while the fine-inpainting branch generates the final refined results. Moreover, compared to MTRNet, it can remove text under very coarse masks as shown in Figs. 1c and 1d.

The mask-refine branch is designed to refine a coarse mask into an accurate pixel-level mask. The mask-refining branch is introduced for the following reasons: (1) To remove text as EnsNet does without requiring third-party text-localisation information, but also to allow utilisation of third-party information as done by MTRNet. (2) To provide interpretability, flexibility and a generalisation ability. (3) To provide attention scores to the coarse-inpainting branch. (4) To speed up the convergence of the network on a large-scale dataset.

The coarse-inpainting branch is a parallel branch to the mask-refine branch, which performs a coarse inpainting using the coarse mask. The coarse-to-fine framework has been shown to be beneficial for realistic inpainting (Yu et al., 2018b, Ma et al., 2019), and the coarse-branch is guided by the attention scores generated by the attention blocks that map intermediate features of the mask-refine branch to weights.

The fine-inpainting branch is introduced for refining the results of the coarse-inpainting branch with precise masks from the mask-refine branch. The coarse results are blurry and lack details. The fine-inpainting branch increases inpainting quality. In this work, for efficiency, a light sub-network is used as the fine-inpainting branch.

In summary, our contributions are as follows:

•
We propose a novel one-stage architecture for text removal. With this architecture, MTRNet++ is free from external text localisation methods, yet can also leverage external information.
•
MTRNet++ achieves state-of-the-art quantitative and qualitative results both on Oxford (Gupta et al., 2016) and SCUT (Zhang et al., 2018) datasets without external masks. Ablation studies shows the proposed architecture and its components play important roles.
•
MTRNet++ is a fully end-to-end trainable network and is easily trainable. It converges on large-scale datasets within an epoch. It also demonstrates controllability and interpretability.

We also introduce other incremental modifications regarding training losses, training strategy and the discriminator, which will be discussed in Section 3.

The rest of the paper is organised as follows. The next section presents related literature. Section 3 illustrates the proposed network architecture (generator and discriminator), training losses and training strategy. Section 4 presents experiments, ablation studies and analysis. Finally, we provide a brief summarisation of our work in Section 5.

Section snippets

Literature

Text removal is a special case of image inpainting, which usually requires the assistance of a text-detection method for text localisation. Early text removal approaches (Khodadadi and Behrad, 2012, Wagh and Patil, 2015, Tursun et al., 2019a) are two-stage methods based on either traditional text detection or inpainting approaches. With the advance of deep learning, many classical problems including image inpainting are solved in a single stage using a deep encoder–decoder neural network (Mao

MTRNet++

MTRNet++ is text-inpainting network, and is formulated as a conditional generative adversarial network (Isola et al., 2017). It is composed of a multi-branch generator $G$ and a discriminator $D$ . In the following sections, we illustrate their structures and training strategy.

Datasets and evaluation metrics

In this study, MTRNet++ is primarily compared with the previous state-of-the-art methods, MTRNet and EnsNet. For a fair comparison, the same datasets and evaluation metrics introduced by MTRNet and EnsNet are applied. The comparison includes quantitative and qualitative results. Quantitative results are given for synthetic datasets that have ground-truth, while qualitative visualisations are provided for both synthetic and real datasets. Most of the datasets used for training and evaluation are

Conclusion

In this work, a one-stage mask-based conditional generative adversarial network, MTRNet++, is proposed for real-scene text removal. It is self-complete, controllable and interpretable. It shows state-of-the-art quantitative results on both the Oxford and SCUT test datasets without user-provided text masks. Visual results also show MTRNet++ generates realistic inpainting for text regions. Related ablation studies show that proposed multi-branch generator is essential for state-of-the-art

CRediT authorship contribution statement

Osman Tursun: Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation. Simon Denman: Supervision, Writing - review & editing, Conceptualization, Methodology. Rui Zeng: Conceptualization, Methodology. Sabesan Sivapalan: Conceptualization, Methodology, Supervision. Sridha Sridharan: Writing - review & editing, Supervision, Project design. Clinton Fookes: Supervision, Project design, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by an Advance Queensland Research Fellowship Award, Australia .

References (33)

WolfC. et al.
Evaluation of video activity localizations integrating quality and quantity measurements
Comput. Vis. Image Underst. (CVIU)
(2014)
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H., 2019. Character region awareness for text detection. In: Proceedings of...
Eigen, D., Krishnan, D., Fergus, R., 2013. Restoring an image taken through a window covered with dirt or rain. In:...
Gupta, A., Vedaldi, A., Zisserman, A., 2016. Synthetic data for text localisation in natural images. In: Proc. of...
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...
IizukaS. et al.
Globally and locally consistent image completion
ToG
(2017)
IsolaP. et al.
Image-to-image translation with conditional adversarial networks
CVPR
(2017)
JoY. et al.
Sc-fegan: Face editing generative adversarial network with user’s sketch and color
(2019)
JohnsonJ. et al.
Perceptual losses for real-time style transfer and super-resolution
KaratzasD. et al.
Icdar 2013 robust reading competition

KhodadadiM. et al.

Text localization, extraction and inpainting in color images

Liu, G., Reda, F.A., Shih, K.J., Wang, T.-C., Tao, A., Catanzaro, B., 2018. Image inpainting for irregular holes using...

MaY. et al.

Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation

Maas, A.L., Hannun, A.Y., Ng, A.Y., 2013. Rectifier nonlinearities improve neural network acoustic models. In: Proc....

MaoX.-J. et al.

Image restoration using convolutional auto-encoders with symmetric skip connections

(2016)

MiyatoT. et al.

Spectral normalization for generative adversarial networks

(2018)

Cited by (42)

A new deep CNN for 3D text localization in the wild through shadow removal
2024, Computer Vision and Image Understanding
Text localization in the wild is challenging due to the presence of 2D and 3D texts, the presence of shadows, arbitrary orientated text with non-linear arrangements, varying lighting conditions as well as complex background. This paper proposes the first approach for 3D text localization in natural scene images through shadow removal and a new deep CNN model. In a first step, exploiting the observation that 3D text generates shadow information in natural scenes, the proposed model detects and removes the shadow pixels of 3D text based on the Generalized Gradient Vector Flow concept and a new clustering approach. The performance of the classification of 2D and 3D texts in the scene images is strengthened by using key features, including pixel strength, sharpness and edge potential, which are extracted to eliminate false text and shadow pixels. For text localization after removing shadow information, EfficientNet is used as an encoder (backbone) and UNet as a decoder in a novel way employing differential binarization. Experimental validation and comparative analysis with state-of-the-art approaches on both a new purpose-built dataset as well as on the benchmark datasets of ICDAR MLT 2019, ICDAR ArT 2019, CTW1500, DAST1500, Total-Text, and MSRATD500 for each of the different steps of the method, show that the proposed approach outperforms the existing methods.
Progressive scene text erasing with self-supervision
2023, Computer Vision and Image Understanding
Scene text erasing seeks to erase text contents from scene images and current state-of-the-art text erasing models are trained on large-scale synthetic data. Although data synthetic engines can provide vast amounts of annotated training samples, there are differences between synthetic and real-world data. In this paper, we employ self-supervision for feature representation on unlabeled real-world scene text images. A novel pretext task is designed to keep consistent among text stroke masks of image variants. We design the Progressive Erasing Network in order to remove residual texts. The scene text is erased progressively by leveraging the intermediate generated results which provide the foundation for subsequent higher quality results. Experiments show that our method significantly improves the generalization of the text erasing task and achieves state-of-the-art performance on public benchmarks.
FETNet: Feature erasing and transferring network for scene text removal
2023, Pattern Recognition
The scene text removal (STR) task aims to remove text regions and recover the background smoothly in images for private information protection. Most existing STR methods adopt encoder-decoder-based CNNs, with direct copies of the features in the skip connections. However, the encoded features contain both text texture and structure information. The insufficient utilization of text features hampers the performance of background reconstruction in text removal regions. To tackle these problems, we propose a novel Feature Erasing and Transferring (FET) mechanism to reconfigure the encoded features for STR in this paper. In FET, a Feature Erasing Module (FEM) is designed to erase text features. An attention module is responsible for generating the feature similarity guidance. The Feature Transferring Module (FTM) is introduced to transfer the corresponding features in different layers based on the attention guidance. With this mechanism, a one-stage, end-to-end trainable network called FETNet is constructed for scene text removal. In addition, to facilitate research on both scene text removal and segmentation tasks, we introduce a novel dataset, Flickr-ST, with multi-category annotations. A sufficient number of experiments and ablation studies are conducted on the public datasets and Flickr-ST. Our proposed method achieves state-of-the-art performance using most metrics, with remarkably higher quality scene text removal results.
Video frame interpolation via down–up scale generative adversarial networks
2022, Computer Vision and Image Understanding
Citation Excerpt :
This method produced more disturbing artifacts than others, although it derived preferable output for video sequences that contain motion blurs and brightness changes. GANs have found wide applications, including realistic image generation (Karras et al., 2018; Brock et al., 2018), noisy image generation (Chang et al., 2020; Chen et al., 2020), inpainting (Demir and Unal, 2018; Tursun et al., 2020), and text-to-image generation (Zhang et al., 2018; Yanagi et al., 2019), and also video frame generation (Mathieu et al., 2016; Vondrick et al., 2016; Li et al., 2018). Li et al. (2018) proposed a fully CNN-based generator and a discriminator for frame interpolation.
Frame interpolation finds many applications in video applications, including frame rate up-conversion and video compression. Deep learning-based methods have been proposed for frame interpolation, but a long runtime is typically required to achieve good visual quality. In this paper, we introduce an efficient frame interpolation method based on a modified generative adversarial network. The proposed framework consists of a generator with a pair of down–up scale modules, where the down-scaled-input module attempts to capture the overall structure of the scene while the original-scale-input module aims to restore finer textures. Skip connections and an input processing block are further incorporated into the minimal two-scale generator design to expedite processing without losing image features. The difference between the synthesized frame and the ground truth is measured by a combined loss function, including one adversarial loss and three reconstruction losses. Compared to the state-of-the-art motion compensation and deep-learning based frame interpolation approaches, the proposed framework achieves the most satisfactory trade-off between the synthesis quality and runtime.
ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
2024, Proceedings of the AAAI Conference on Artificial Intelligence
DeepEraser: Deep Iterative Context Mining for Generic Text Eraser
2024, arXiv

View all citing articles on Scopus

View full text

MTRNet++: One-stage mask-based scene text eraser

Highlights

Abstract

Introduction

Section snippets

Literature

MTRNet++

Datasets and evaluation metrics

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

Comput. Vis. Image Underst. (CVIU)

Globally and locally consistent image completion

ToG

Image-to-image translation with conditional adversarial networks

CVPR

Sc-fegan: Face editing generative adversarial network with user’s sketch and color

Perceptual losses for real-time style transfer and super-resolution

Icdar 2013 robust reading competition

Text localization, extraction and inpainting in color images

Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation

Image restoration using convolutional auto-encoders with symmetric skip connections

Spectral normalization for generative adversarial networks