Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

Deepika Kumar

Varun Srivastava

Daniela Elena Popescu and Jude D. Hemanth

Resumen

Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of what needs to be achieved. An encoder?decoder model which uses the image feature vectors as an input to the encoder is often marked as one of the appropriate models to accomplish the captioning process. In the proposed work, a dual-modal transformer has been used which captures the intra- and inter-model interactions in a simultaneous manner within an attention block. The transformer architecture is quantitatively evaluated on a publicly available Microsoft Common Objects in Context (MS COCO) dataset yielding a Bilingual Evaluation Understudy (BLEU)-4 Score of 85.01. The efficacy of the model is evaluated on Flickr 8k, Flickr 30k datasets and MS COCO datasets and results for the same is compared and analysed with the state-of-the-art methods. The results shows that the proposed model outperformed when compared with conventional models, such as the encoder?decoder model and attention model.

Palabras claves

attention model - encoder?decoder model - multi-modal transformer - BLEU score - beam search

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 12 Parte: 13 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Algorithms
Information

DOI

https://doi.org/10.3390/app12136733

Artículos similares

Image-Captioning Model Compression

Acceso

Viktar Atliha and Dmitrij ?e?ok

Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large compu... ver más

Revista: Applied Sciences

Panoptic Segmentation-Based Attention for Image Captioning

Acceso

Wenjie Cai, Zheng Xiong, Xianfang Sun, Paul L. Rosin, Longcun Jin and Xinyi Peng

Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, th... ver más

Revista: Applied Sciences

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Acceso

Boeun Kim, Saim Shin and Hyedong Jung

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been... ver más

Revista: Applied Sciences

Dense Model for Automatic Image Description Generation with Game Theoretic Optimization

Acceso

Sreela S R and Sumam Mary Idicula

Due to the rapid growth of deep learning technologies, automatic image description generation is an interesting problem in computer vision and natural language generation. It helps to improve access to photo collections on social media and gives guidance... ver más

Revista: Information

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas