DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

1Soochow University, 2bilibili Inc, 3Inf Tech.
*Equal author, Corresponding author.

Abstract

More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position.

Network Architecture

realcamnet_network_architecture

DNTextSpotter: The overall framework of DNTextSpotter. The model utilizes a backbone and an encoder to extract multi-scale features. The queries of the decoder can be divided into two parts: a matching part and a denoising part. The queries in the matching part are randomly initialized queries. The noised queries of denoising part can be found in Noised Positional Queries and Noised Content Queries. After decoder and task-specific head, the matching part calculates loss through a bipartite graph matching algorithm, and the denoising part calculates loss directly with the ground truth.

Noised Positional Queries & Noised Content Queries

realcamnet_coordinate_related_distortion

Noised Content Queries: This image shows the process of generating noised content queries from the ground-truth texts. "Ø" indicates the characters will be masked, while "σ" denotes flipping the characters into any character.

Second Image Description

Noised Positional Queries: We generate noised positional queries using four Bezier control points from the ground truth, which includes uniformly sampling points along the Bezier curve, position embedding, and a two-layer MLP.

Rate-Distortion Curve Results

realcamnet_coordinate_related_distortion

The convergence curves of DNTextSpotter (Ours), DeepSolo, TESTR, and ESTextSpotter on the Inverse-Text dataset using the ResNet-50 backbone in the 'None' results, where 'None' denotes the F1-measure without lexicon.

Second Image Description

For the IS of TESTR, DeepSolo, and DNTextSpotter, we trained for 120k steps under the same settings, calculating the IS at every consecutive 10k step interval.

Experiment results

realcamnet_quantitative_results

Performances on Total-Text and CTW1500 with different backbone. E2E denotes the end-to-end spotting results. "None" denotes lexicon-free. "Full" denotes the inclusion of all words present in the test dataset. The top three scores are shown in bold red, blue, and green fonts. Additionally, results without TextOCR in pre-training are indicated with "*".

Visual Comparison Results

realcamnet_quantitative_results

Several instance examples: rows display ESTextSpotter, DeepSolo, and DNTextSpotter (Ours) visualizations, respectively. In the recognition results, blue within parentheses represents correct recognition, while red denotes incorrect ones; outside the parentheses, Ø signifies no detection or no recognition took place. Additional visual analysis is provided in the paper.

Acknowledgements

We would like to thank Ziqiang Cao and Zili Wang for providing us with 4 days of GPU usage, which allowed us to run and find the best set of parameters.

BibTeX

@inproceedings{qiao2024dntextspotter,
  title={DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training},
  author={Qiao, Qian and Xie, Yu and Gao, Jun and Wu, Tianxiang and Huang, Shaoyao and Fan, Jiaqing and Cao, Ziqiang and Wang, Zili and Zhang, Yue},
  booktitle={ACM Multimedia 2024}
}