Resumen
In the age of deep learning, researchers have looked at domain adaptation under the pre-training and fine-tuning paradigm to leverage the gains in the natural image domain. These backbones and subsequent networks are designed for object detection in the natural image domain. They do not consider some of the critical characteristics of document images. Document images are sparse in contextual information, and the graphical page objects are logically clustered. This paper investigates the effectiveness of deep and robust backbones in the document image domain. Further, it explores the idea of learnable object proposals through Sparse R-CNN. This paper shows that simple domain adaptation of top-performing object detectors to the document image domain does not lead to better results. Furthermore, empirically showing that detectors based on dense object priors like Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN are perhaps not best suited for graphical page object detection. Detectors that reduce the number of object candidates while making them learnable are a step towards a better approach. We formulate and evaluate the Sparse R-CNN (SR-CNN) model on the IIIT-AR-13k, PubLayNet, and DocBank datasets and hope to inspire a rethinking of object proposals in the domain of graphical page object detection.