Resumen
The focus of fine-grained image classification tasks is to ignore interference information and grasp local features. This challenge is what the visual attention mechanism excels at. Firstly, we have constructed a two-level attention convolutional network, which characterizes the object-level attention and the pixel-level attention. Then, we combine the two kinds of attention through a second-order response transform algorithm. Furthermore, we propose a clustering-based grouping attention model, which implies the part-level attention. The grouping attention method is to stretch all the semantic features, in a deeper convolution layer of the network, into vectors. These vectors are clustered by a vector dot product, and each category represents a special semantic. The grouping attention algorithm implements the functions of group convolution and feature clustering, which can greatly reduce the network parameters and improve the recognition rate and interpretability of the network. Finally, the low-level visual features and high-level semantic information are merged by a multi-level feature fusion method to accurately classify fine-grained images. We have achieved good results without using pre-training networks and fine-tuning techniques.