Resumen
To overcome the challenges of inadequate representation and ineffective information exchange stemming from feature homogenization in underwater acoustic target recognition, we introduce a hybrid network named Mobile_ViT, which synergizes MobileNet and Transformer architectures. The network begins with a convolutional backbone incorporating an embedded coordinate attention mechanism to enhance the local details of inputs. This mechanism captures the long-term temporal dependencies and precise frequency?domain relationships of signals, focusing the features on the time?frequency positions. Subsequently, the Transformer?s Encoder is integrated at the end of the backbone to facilitate global characterization, thus effectively overcoming the convolutional neural network?s shortcomings in capturing long-range feature dependencies. Evaluation on the Shipsear and DeepShip datasets yields accuracies of 98.50% and 94.57%, respectively, marking a substantial improvement over the baseline. Notably, the proposed method also demonstrates obvious separation coefficients, signifying enhanced clustering effectiveness, and is lighter than other Transformers.