More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position.
DNTextSpotter: The overall framework of DNTextSpotter. The model utilizes a backbone and an encoder to extract multi-scale features. The queries of the decoder can be divided into two parts: a matching part and a denoising part. The queries in the matching part are randomly initialized queries.
The noised queries of denoising part can be found in Noised Positional Queries and Noised Content Queries. After decoder and task-specific head, the matching part calculates loss through a bipartite graph matching algorithm, and the denoising part calculates loss directly with the ground truth.
Performances on Total-Text and CTW1500 with different backbone. E2E denotes the end-to-end spotting results. "None" denotes lexicon-free. "Full" denotes the inclusion of all words present in the test dataset. The top three scores are shown in bold red, blue, and green fonts. Additionally, results without TextOCR in pre-training are indicated with "*".
Several instance examples: rows display ESTextSpotter, DeepSolo, and DNTextSpotter (Ours) visualizations, respectively. In the recognition results, blue within parentheses represents correct recognition, while red denotes incorrect ones; outside the parentheses, Ø signifies no detection or no recognition took place. Additional visual analysis is provided in the paper.
We would like to thank Ziqiang Cao and Zili Wang for providing us with 4 days of GPU usage, which allowed us to run and find the best set of parameters.
@inproceedings{qiao2024dntextspotter,
title={DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training},
author={Qiao, Qian and Xie, Yu and Gao, Jun and Wu, Tianxiang and Huang, Shaoyao and Fan, Jiaqing and Cao, Ziqiang and Wang, Zili and Zhang, Yue},
booktitle={ACM Multimedia 2024}
}