Resumen
The complex backgrounds of crop disease images and the small contrast between the disease area and the background can easily cause confusion, which seriously affects the robustness and accuracy of apple disease- identification models. To solve the above problems, this paper proposes a Vision Transformer-based lightweight apple leaf disease- identification model, ConvViT, to extract effective features of crop disease spots to identify crop diseases. Our ConvViT includes convolutional structures and Transformer structures; the convolutional structure is used to extract the global features of the image, and the Transformer structure is used to obtain the local features of the disease region to help the CNN see better. The patch embedding method is improved to retain more edge information of the image and promote the information exchange between patches in the Transformer. The parameters and FLOPs (Floating Point Operations) of the model are significantly reduced by using depthwise separable convolution and linear-complexity multi-head attention operations. Experimental results on a complex background of a self-built apple leaf disease dataset show that ConvViT achieves comparable identification results (96.85%) with the current performance of the state-of-the-art Swin-Tiny. The parameters and FLOPs are only 32.7% and 21.7% of Swin-Tiny, and significantly ahead of MobilenetV3, Efficientnet-b0, and other models, which indicates that the proposed model is indeed an effective disease-identification model with practical application value.