Self-reduction multi-head attention module for defect recognition of power equipment in substation☆

doi:10.1016/j.gloei.2024.11.016

Figure（0）

Tables（0）

Author Information

Publication Information

Self-reduction multi-head attention module for defect recognition of power equipment in substation☆

Yifeng Han^* ,Donglian Qi ,Yunfeng Yan

（ Zhejiang University, No.38 Zheda Rd, Hangzhou 310007 Zhejiang, PR China , Received 14 July 2024; revised 7 October 2024; accepted 7 November 2024 ）

DOI:10.1016/j.gloei.2024.11.016

Keywords

Multi-Head attention; Defect recognition; Power equipment; Computational complexity

Abstract

Abstract Safety maintenance of power equipment is of great importance in power grids,in which image-processing-based defect recognition is supposed to classify abnormal conditions during daily inspection.However, owing to the blurred features of defect images, the current defect recognition algorithm has poor fine-grained recognition ability.Visual attention can achieve fine-grained recognition with its ability to model long-range dependencies while introducing extra computational complexity, especially for multi-head attention in vision transformer structures.Under these circumstances, this paper proposes a self-reduction multi-head attention module that can reduce computational complexity and be easily combined with a Convolutional Neural Network (CNN).In this manner, local and global features can be calculated simultaneously in our proposed structure,aiming to improve the defect recognition performance.Specifically,the proposed self-reduction multi-head attention can reduce redundant parameters, thereby solving the problem of limited computational resources.Experimental results were obtained based on the defect dataset collected from the substation.The results demonstrated the efficiency and superiority of the proposed method over other advanced algorithms.©2025 Global Energy Interconnection Group Co.Ltd.Publishing services by Elsevier B.V.on behalf of KeAi Communications Co.Ltd.This is an open access article under the CC BY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).

0 Introduction

Daily inspection of power equipment is one of the most important methods for ensuring the safe and stable operation of power systems.The timely and rapid detection of equipment defects is of great significance for ensuring the safety of power grids.However, the inspection process for carefully detecting defects is tedious and requires a large amount of manual labor [1].Recently, intelligent defect recognition methods based on image processing have been extensively studied for use in the automatic inspection of power equipment.The deep learning-based image processing method is intended to automatically recognize existing defects and provide an alarm to the staffon duty.

In recent years, image recognition has made significant progress with both CNN and transformer structures in the field of artificial intelligence.AlexNet has demonstrated the significant ability of deep convolutional neural networks in image classification tasks [2].VGGNet first evaluates a deep network using 3×3 convolution filters and then proves its efficiency [3].ResNet uses a residual structure to provide better training results for deep neural networks [4].ViT (Vision Transformer) introduces an attention structure into image recognition tasks and obtains state-of-the-art results compared with a convolutional neural network [5].

Thus, image-recognition technology has been widely applied in smart grids.Li et al.introduced facerecognition technology into the access system of a smart grid [6].Cao et al.studied small-scale defect identification tasks based on image recognition technology [7].Chen et al.applied the SIFT feature description operator to improve the image recognition algorithm of a smart grid[8].Under these circumstances, intelligent inspection of power equipment has improved.However, unlike public datasets, a defect dataset of a power system has blurred and complex features.This results in more stringent requirements for intelligent image recognition algorithms.

The current intelligent substation inspection system consists of three parts, as illustrated in Fig.1.First, the daily inspection of power equipment is performed using an inspection camera that automatically detects defects using a corresponding detection algorithm [9].Once defects are detected,the inspection robot is sent to a specific location for a fine-grained recheck, and the detailed images are sent back to the central server.Subsequently,a specific recognition algorithm was utilized to classify the defects.The system provides an alarm once a defect is confirmed.

To improve the detection accuracy and reduce the false alarm rate, a recognition algorithm was used to judge defects in equipment images captured at a close range.However, the defect images in the substation exhibit blurred features with variable shapes.Under these circumstances,the defect recognition model is expected to possess a more powerful ability for a detailed understanding.Additionally,daily inspections accumulate a large number of images, which creates challenging requirements for the processing speed and calculation complexity of this recognition model.

Fig.1 Intelligent inspection system in substation.

For a more detailed understanding and higher recognition accuracy,researchers have introduced visual attention to the study of image recognition tasks.Unlike the CNN structure, the attention mechanism models long-range dependencies.In addition, a neural network model without an attention mechanism treats all input features in the image equally and cannot screen key features.This makes it more difficult for the network to learn the image data with blurred features.Therefore, a visual attention mechanism was designed to improve the performance of neural networks using plug-and-play modules.Squeeze and excitation networks (SEnet) [10], the convolutional block attention module (CBAM) [11], nonlocal networks[12],and implicit nonlocal networks[13]improve the accuracy of basic neural networks based on various attention structures.Subsequently, a vision transformer, mainly composed of multi-head attention modules, is used in image recognition tasks.

However, attention-based image recognition tasks suffer from two major problems.First, the attention mechanism introduces higher computational complexity into the original neural network.Second,a specific visual transformer requires large quantities of high-quality training data.

To obtain a defect detection algorithm with stronger detailed understanding ability, this study constructs a plug-and-play attention module and introduces a selfreduction mechanism [13] into multi-head attention.In this manner, a self-reduction multi-head attention module was designed.This module possesses the advantage of a transformer structure while solving the problems of higher calculation costs and more data requirements.Subsequently, the self-reduction multi-head attention module was combined with a neural network for defect recognition of power equipment in substations.Thus,this method can simultaneously model long-range dependency from multihead attention and global features from CNN.

The main contributions of this paper are summarized as follows:

1) To improve the detailed understanding of the defect recognition method, a self-reduction multi-head attention module was designed that makes multihead attention a plug-and-play module with fewer parameters and lower computational complexity.

2) The proposed method improves the accuracy of defect recognition tasks in the fine-grained rechecking of intelligent inspection systems.This is expected to reduce the labor costs of daily operations and maintenance in substations.

3) The proposed method combines multi-head attention with a CNN that combines both local and global relationships.With the specific design of the self-reducing multi-head attention,this method effectively utilizes the advantages of the transformer and solves the problem of training difficulty from the transformer structure.

1 Related work

1.1 Visual attention

As an important part of obtaining external information,the human visual system has a precise organizational structure and complex neural processing mechanisms.It first captures external images through the eyes and converts them into the corresponding neural signals.Faced with massive amounts of image information, the human visual system can quickly filter it and accurately transmit it to the human brain.This type of information filtering,which is the central psychological regulatory mechanism of human behavior, has attracted the attention of researchers.

In general, the human visual system quickly screens a large amount of external input visual information through a complex mechanism and preserves the most influential features.Inspired by this, researchers proposed attention mechanisms and applied them to the field of deep learning.The visual attention mechanism can extract data features more efficiently and selectively filter irrelevant information, which is significant for training deep networks.The visual attention mechanism models the relationships between input data and introduces relationship features into the network training process, thereby focusing the attention of the model on the characteristics of the data[14-16].

In image-processing tasks,researchers have constructed a plug-and-play module based on an attention mechanism.SEnet [10] reallocates the channel dimensions of the feature maps to change the impact of each single-channel feature map on the overall feature distribution.To achieve the interactive fusion of features between the channel and spatial attention, CBAM [11] cascades the proposed channel attention and spatial attention to construct a residual attention module.A nonlocal network[12],a typical representative of spatial attention models, is designed to model the relationship within each channel and assign attention weights to different pixels.The Mixed High-Order Attention mechanism[17]proposes high-order polynomials to predict complex high-order relationships among the input image features.The multi-head attention model [18] was applied by Google to machine translation tasks, which completely abandons convolutional neural networks and recurrent neural networks with an attention mechanism.A comparison of different attention modules is presented in Table 1.Among these,multi-head attention exhibited the most efficient performance.As the core component of the transformer structure,multiple second-order attention models are connected in parallel, with limited calculations of the relationship and high computational complexity.

1.2 Visual attention-based intelligent inspection of power equipment

Image-processing-based defect detection in power equipment encounters difficulties with complex image data backgrounds, few negative data points, and unclear sample features.Under these circumstances, researchers have introduced visual attention mechanisms for the intelligent inspection of power equipment tasks.Zhao et al.[19]added an edge attention mechanism to a residual network and proposed an attention-based generation network to reconstruct the low-resolution infrared thermal images of power equipment.Lu et al.[20] proposed a text classification method for power equipment defects based on a multi-head attention recurrent convolutional neural network.Wu et al.[21] utilized attention models to achieve feature fusion during upsampling, which improved the accuracy of rust detection using power equipment.Wan[22] combined a convolutional attention module with a segmentation network and proposed a pointer instrument reading recognition algorithm.Liu et al.[23] proposed an electricity meter nameplate recognition algorithm based on an attention mechanism, introducing an attention mechanism into the encoder-decoder structure to improve the nameplate recognition results.Zhang [24] combined the YOLO-v3 object detection algorithm with an attention mechanism and proposed a fire detection algorithm based on an attention mechanism that enhanced the expression ability of feature semantics.He et al.[25]combined convolutional neural networks and channel attention mechanisms to increase the weight of local fault information and achieved improved recognition accuracy in the detection of lightning arresters, circuit breakers, current transformers, and voltage transformers.

Intelligent algorithms should be capable of finegrained recognition to achieve higher accuracy of automatic inspection in power systems.Therefore, most algorithms apply an attention mechanism to this method, as illustrated in Table 2.Although the above researchers improved the ability of intelligent power equipment inspection, there are still problems, such as insufficient feature fusion, high computational complexity, and diffi-culty in calculating high-order models.Insufficient feature fusion indicates that the model simply applies a single type of attention, which introduces either channel or spatial attention into the network.In addition, the attention mechanism is considered to have higher computational complexity because of the affinity map calculation.Additionally, high-order attention can model a more detailed relationship but requires a more lightweight architecture.

Table 1 Comparison among several attention modules.

Attention moduleRelationship functionOrder SEnetFeature in each channel as weight1 Non-local networkMultiplication between two matrices2 High-order attention Multiplication among several matrices3+Cascading attention Spatial attention and channel attention1 + 2 Multi-head attention Parallelization of several spatial attention 2

Table 2 Comparison of intelligent inspection methods for power equipment.

Method AdvantageShortcoming[19]Apply edge attention mechanism to the residual network Insufficient feature fusion[20]Combine multi-head attention with recurrent network.Redundant parameter[21]Use attention models to achieve feature fusion during up-sampling Insufficient feature fusion[22]Combine attention module with segmentation network High computational complexity[23]Introduce attention mechanism into the encoder-decoder structure High computational complexity[24]Insert attention module into YOLOv3 detection method Difficult for high-order attention calculation[25]The application of channel attention Insufficient feature fusion

In contrast, this study first constructs a multi-head module with a convolutional feature to model high-order attention.The self-reduction mechanism was then combined with the proposed multi-head attention to reduce the redundant parameters and computational complexity.Finally, the self-reduction module was combined with a convolutional network to achieve fine-grained defect recognition in a substation.

1.3 Transformer and multi-head attention

In recent years,transformer structures have been widely used in natural language processing and computer vision.Unlike convolutional neural networks and recurrent neural networks, the transformer is composed entirely of attention mechanisms and incorporates the relationship features between semantic blocks into the context vector[18].

To investigate whether the attention mechanism can completely replace convolutional kernels in image feature extraction tasks, Ramachandran et al.directly used a self-attention module to replace the 3×3 convolutional kernels in ResNet and demonstrated that the replaced network performed better [26].Dosovitskiy et al.applied a transformer to image-processing tasks and proposed a Vision Transformer model [27].Google’s team proposed a transformer-based object detection framework called DETR [28].To improve the convergence speed and object-detection performance of the DETR model during training, Zhu et al.proposed a Deformable DETR [29].The Pix2seq model considers object detection tasks in image processing as language processing tasks and assumes that the model knows the category and location of the object while teaching the model how to output information during training[30].In image segmentation,a Segmentation Transformer combines ViT with a simple decoder to produce a powerful segmentation model [31].TransUnet embeds a transformer model in a U-shaped network structure [32].The CMT model utilizes both the transformer model’s ability to extract distant features and the network’s ability to extract local features [33].To simultaneously extract the advantages of convolutional neural networks and attention mechanisms,CoAtNet proposes a series of hybrid models of transformer attention mechanisms and a convolutional neural network to better balance the model parameter quantity and accuracy [34].The conformer model utilizes the coexistence structure of the transformer and Convolutional Neural Network to simultaneously obtain local features and global dependencies and achieve the fusion of the two features [35].The CvT model introduces convolutional operations into the ViT model, constructing a hierarchical transformer structure with new convolution token embeddings and a convolutional transformer block [36].To solve the problem of high computational consumption in visual transformer models, the P2T model applies pyramid pooling to multi-head attention mechanisms while reducing the sequence length and capturing powerful contextual features[37].Li et al.proposed a visual transformer with deep separable self-attention to solve the problem of the high computational burden in transformer models [38].The above model combines the core structure of the transformer with convolution computation while taking advantage of the advantages of both algorithms, resulting in a more efficient model.

Although the above models outperform current algorithms in several image tasks,they still have certain limitations.First, networks with multi-head attention as the main structure suffer from problems such as high computational complexity and resource consumption during training.In addition, the excellent performance of the transformer in numerous tasks relies on a pretrained model on large public datasets, making its performance unstable on some uncommon datasets.

2 Method

This section first proposes the self-reduction multi-head attention module and then inserts it into ResNet [4] to complete the defect recognition task.

2.1 Self-reduction multi-head attention module

As the most important component of a transformer,multi-head attention can model long-range dependencies with several parallel blocks.In this section,the basic structure of a nonlocal network is used to design parallel blocks.The self-reduction operation [13] has been extended to multi-head attention.

The training structure of the self-reduction multi-head attention is illustrated in Fig.2.Thus, in contrast to the multi-head attention structure in ViT, each block in the proposed self-reduction multi-head attention first utilizes a convolutional operation to obtain feature vectors.Subsequently, by designing several parallel denominators, the affinity map calculation function can be automatically reduced after training.The input and output feature maps have the same size, which means that the self-reducing MHA module can be easily inserted at any stage of the deep neural network.

Mathematically, the input feature map of the selfreduction multi-head attention module can be represented by X ∈Rc×h×w a2nd the output is Y ∈Rc×h×w,where c is the channel;h and w are the width and height,respectively,of the feature map; and n is the head number of the multihead attention module,which indicates the number of single blocks that are completely parallel.

First, the input X passes through the convolutional layer fconv with the kernel 1×1 and obtains the feature map pagenumber_ebook=89,pagenumber_book=86

Fig.2 Structure of self-reduction multi-head attention.

Then, X′ is input into n multi-head attention modules,in each module:

X′ is passed through three convolutional layers with a 1×1 kernel to obtain the three output features ti（x）,pi（x）, and gi（x） with size pagenumber_ebook=89,pagenumber_book=86 , and i ∈[1,n].These three features are then extended to three twodimensional feature maps of size and yield θ1（x）,θ2（x）, and g（x）.Similar to the calculation process in [13],the proposed method first sets a self-reducing denominator for each block.The gradients of these denominators are not calculated in the backpropagation training process.During the inference, these denominators served as constants.

The mathematical process can be illustrated as:

Mi represents the output of each simple block.valuerepresents the denominator,which is supposed to achieve self reduction during the inference process.

After that, these output tensors are concatenated together and get the feature map pagenumber_ebook=89,pagenumber_book=86 :

Then Y′ passes through a convolutional layer with kernel 1×1 and is added to the input X to construct the following residual form:

After training,parameters ti（x）and pi（x）can be reduced automatically.The structure is illustrated in Fig.3.The inference procedure can be described mathematically:

First,

Then, the outputs are concatenated:

The final output is:

2.2 Intelligent defect recognition method

Current intelligent inspection systems in substations rely on high-quality algorithms with detailed understanding and feature extraction ability.In this study, a selfreducing MHA module was designed to improve the performance of the defect recognition module.Multi-head attention is the core component of a transformer and is considered the most powerful deep network in several computer vision tasks.However, transformer training requires a large amount of high-quality data.This makes its application to intelligent inspection tasks for smart grids difficult.In contrast,a convolutional neural network cannot model long-range dependency like a transformer but requires less training data.Under these circumstances,this study combined the MHA and convolutional neural network.

Fig.3 Structure of self-reduction multi-head attention after training.

Fig.4 ResNet50 with the proposed multi-head attention.

Specifically, this method inserts the proposed selfreduction multi-head attention module into ResNet50[4].The structure of the intelligent defect recognition method is illustrated in Figs.4 and 5.Using the proposed multi-head attention, the network can model long-range dependency and introduce additional relationship information during the training process.Thus,the model could extract more detailed information from the training dataset.

3 Experiments

3.1 Training dataset

To evaluate the efficiency of the proposed method, we constructed a defect-recognition dataset.All images were collected from a power grid substation, which ensured the authenticity and validity of the training data.The training dataset comprises four categories: abnormal closure of the electricity box, blurred meter cover, silica gel discoloration, and insulator rupture.Examples of these training images are shown in Fig.6.

The distributions of the defect datasets are listed in Table 3.The dataset consists of 3,772 images in total,comprising 3,192 training images and 580 testing images.All images are close-up shots of the defect details.Each image was resized to 608×608.

3.2 Experiments on computational complexity

To evaluate the ability of our method to reduce computational complexity while improving defect recognition accuracy, this section first evaluates the FLOPs (floatingpoint operations) and parameters between the original multi-head attention and our proposed module.FLOPs were used to measure the overall complexity (required computational load).The head number of the multi-head attention was set to eight in this experiment.

In this experiment, the input feature maps of the multihead attention module and our proposed module were set to 64×64 and 128×128 separately.The experimental results are presented in Table 4, where 1GFLOPs=1×109FLOPs.The evaluation results prove that our proposed method has lower computational complexity and fewer parameters than the original multi-head attention module.

Fig.5 The structural diagram of the proposed recognition method.

3.3 Experiments on defect recognition accuracy

To verify the effectiveness of the proposed multi-head attention and defect recognition method, the network was trained using the proposed defect dataset.As illustrated in Fig.4,self-reduction multi-head attention is combined with ResNet50 in stage 4-2.To evaluate the efficiency of the proposed method,this experiment replaces the proposed method with SEnet, a nonlocal network,CBAM, and multi-head attention from the transformer structure.SEnet applies channel attention to the network,whereas nonlocal networks use spatial attention.CBAM connects the spatial and channel attention successively.In addition, the proposed method was compared with two other methods that focus on optimizing the selfattention mechanism.The criss-cross network (CCnet)[39] applies criss-cross attention to reduce computational complexity, and the global context network (GCnet) [40]simplifies the non-local network with a simplified calculation of the affinity map.

The training was conducted using a GeForce RTX 3090.The batch size was set to 64, and the learning rate was set to 1×10-3.The training process lasted 100 epochs.As the main evaluation metric for image recognition tasks,TOP1 represents the accuracy of the recognition results when the network outputs the result with the highest probability as the final answer.

According to the experimental results in Table 5, the proposed defect recognition method outperformed other algorithms with various attention modules.This is because the proposed method can simultaneously model longrange dependencies from multi-head attention and global features from a CNN.By comparison, CCnet and GCnet both remove unimportant branches during training, leading to missing information.In addition, compared with other simplified transformer-based methods,our proposed method focuses on designing a plug-and-play multi-head attention module instead of directly using the transformer structure.Thus, our method can improve the detailed understanding of the original model with a lower computational complexity.In addition, our proposed method relies on fewer training data, which is more efficient than transformer-based methods.

Fig.6 Examples of defect images in the proposed dataset.

Table 3 Distribution of defect dataset in substation.

Defect in substationTrainTest Blurred meter cover884160 Silica gel discoloration924140 Insulator rupture700140 Abnormal closure of electricity box684140 Total3,192580

Table 4 Computational complexity comparison.

InputModelSelf reduction FLOPsParameters 64×64Multi-head attention×0.47G72,192 64×64Ours■0.34G68,096 128×128Multi-head attention×1.89G72,192 128×128Ours■1.35G68,096

3.4 Ablation study

To verify the reproducibility of our proposed method on other datasets, we evaluated different models for the CUB [41] and Cifar100 datasets [42].The CUB dataset consists of 200 types of birds with 11,788 images,in which the training set contains 5,994 images and the testing set contains 5,794 images.Cifar100 consists of 100 types of data, each containing 600 images.In this dataset, each type of data contained 500 training and 100 testing images.The training was conducted using a GeForce RTX 3090.The batch size was set to 64, and the learning rate was set to 1×10-3.The training process lasted 110 epochs.

Table 6 presents the evaluation results for the different attention modules on the CUB dataset.Table 7 presents the test results for the Cifar100 dataset.

Table 5 Defect Detection Results of the proposed method.

ModelAttentionTOP1 ResNet50×81.38 ResNet50 + SEnetChannel attention84.50 ResNet50 + non-local network Spatial attention85.58 ResNet50 + CBAMCascade attention82.76 ResNet50 + multi-head attention Multi-head attention88.26 GCnetCascade attention86.12 CCnetCriss-cross attention85.42 OursSelf-reduction multi-head attention 90.12

Table 6 Testing results of attention modules on CUB dataset.

ModelAttentionTOP1 ResNet50×85.87 ResNet50 + SEnetChannel attention85.95 ResNet50 + non-local network Spatial attention85.97 ResNet50 + multi-head attention Multi-head attention86.35 OursSelf-reduction multi-head attention 86.75

Table 7 Testing results of attention modules on cifar100 dataset (without pretrained model).

ModelAttentionTOP1 ResNet50×73.24 ResNet50 + SEnetChannel attention73.41 ResNet50 + non-local network Spatial attention73.53 ResNet50 + multi-head attention Multi-head attention74.27 OursSelf-reduction multi-head attention 74.87

The experimental results demonstrate the efficiency and reproducibility of the proposed self-reduction multi-head attention module on other datasets, except for the proposed defect recognition datasets in the power grid.

3.5 Limitation analysis

In this section, the limitations of our research are analyzed based on the data and modality levels:

Data level: This study focused on fine-grained defect recognition during daily inspections in a substation.However, the training dataset was collected on that day.The infrared images collected at night were excluded from the training dataset.Our method can be extended in the future to be simultaneously applicable to both visible (daytime)and infrared (nighttime) images.

Modality Level: multi-modal data processing can achieve information complementarity, which is beneficial for fine-grained defect recognition.These modalities include images, physical signals (current and voltage),and historical textual records.The proposed method does not consider processing methods for multi-modal inputs,which can be further improved in the future.

4 Conclusion

Fine-grained recognition is of great significance for intelligent inspection tasks in smart grids.Timely warnings of intelligent defect recognition could help ensure the safe and stable operation of the power grid.However, the current algorithms encounter several problems.First, the current attention-based methods use either channel or spatial attention, resulting in insufficient feature fusion.Secondly, the application of the attention mechanism introduces redundant parameters and higher computational resources, making it difficult to apply to edge devices.In addition, current algorithms cannot model high-order attention because of the lack of a simplified attention structure.

Under these circumstances, this study first proposes a self-reduction multi-head attention module that can reduce the computational complexity while improving the recognition accuracy.Specifically, each component of the module can achieve global feature fusion from each pixel point.Multi-head attention can easily model high-order dependencies compared with channel or spatial attention.The proposed module was combined with the classical convolutional neural network ResNet50 and trained to achieve fine-grained defect recognition in a substation.The selfreduction design reduces the redundant parameters and decreases the computational complexity.The experimental results demonstrate that the proposed method has lower complexity and outperforms other advanced methods.

Daily inspection in a substation locates an abnormal area using fixed cameras and then determines the specific defect category based on fine-grained recognition algorithms.The capability of fine-grained recognition algorithms determines their level of intelligence in the automatic inspection process.Additionally, fine-grained defect recognition algorithms for power equipment are typically installed on drones or inspection robots, which require low computational complexity and limited computing resources.The algorithm designed in this study improves the ability to provide a detailed understanding while reducing computational complexity.This enables the model to achieve fine-grained defect classification more efficiently and has practical significance for the construction of smart grids.

To improve the fine-grained recognition ability of the algorithm in smart grids, this study combines the selfreduction method with a multi-head attention mechanism.However, this method only focuses on a single modality during the day without considering multi-modality fusion.Because multi-modality information is more beneficial for fine-grained classification, the proposed attention module can be further explored for cross-modal information processing.Our method can be further improved for simultaneous visible and infrared imaging.

CRediT authorship contribution statement

Yifeng Han: Writing - original draft, Methodology,Investigation,Conceptualization.Donglian Qi:Validation,Supervision.Yunfeng Yan: Writing - review & editing,Formal analysis.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by Major Program of the National Natural Science Foundation of China under Grant 62127803.

References

[1]
R.H.Jiao,Y.Z.Liu,H.He,et al.,A deep learning model for smallsize defective components detection in power transmission tower,IEEE Trans.Power Delivery 37 (4) (2022) 2551-2561. [百度学术]
[2]
A.Krizhevsky, I.Sutskever, G.E.Hinton, ImageNet classification with deep convolutional neural networks, Commun.ACM 60 (6)(2017) 84-90. [百度学术]
[3]
K.Simonyan,A.Zisserman,Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556, 2014. [百度学术]
[4]
K.M.He, X.Y.Zhang, S.Q.Ren, et al., 2016.Deep residual learning for image recognition, in: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016 in Las Vegas, USA, pp: 770-778. [百度学术]
[5]
A.Dosovitskiy,L.Beyer,A.Kolesnikov,et al.,An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint: 2010.11929, 2020. [百度学术]
[6]
Z.P.Li, J.Wang, S.Shen, et al., Design of smart grid network access system based on face recognition, in: Proceedings of IEEE 4th International Conference on Power,Electronics and Computer Applications(ICPECA),2024 in Shenyang,China,2024,pp.1238-1242. [百度学术]
[7]
G.Cao, Y.Liu, Z.Fan, et al., Research on small-scale defect identification and detection of smart grid transmission lines based on image recognition, in: Proceedings of IEEE 4th International Conference on Automation,Electronics and Electrical Engineering(AUTEEE), 2021 in Shenyang, China, 2021, pp.423-427. [百度学术]
[8]
L.Chen, W.Yi, L.H.Zhang, et al., Smart grid image recognition based on neural network and SIFT algorithm, in: Proceedings of International Conference on Networking, Informatics and Computing (ICNETIC), 2023 in Palermo, Italy, 2023, pp.1-5. [百度学术]
[9]
C.Huang, M.H.Chen, L.Wang, Semi-supervised surface defect detection of wind turbine blades with YOLOv4, Global Energy Interconnect.7 (3) (2024) 284-292. [百度学术]
[10]
J.Hu, L.Shen, G.Sun, Squeeze-and-excitation networks, in:Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018 in Salt Lake City, 2018, pp.7132-7141. [百度学术]
[11]
S.Woo, J.Park, J.Y.Lee, et al., CBAM: convolutional block attention module,in:Lecture Notes in Computer Science,Springer International Publishing, 2018, pp.3-19. [百度学术]
[12]
X.L.Wang, R.Girshick, A.Gupta, et al., Non-local neural networks,in: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City,2018 in UT,USA,2018, pp.7794-7803. [百度学术]
[13]
Y.F.Han, X.Chen, S.J.Zhang, et al., iNL: implicit non-local network, Neurocomputing 482 (2022) 50-59. [百度学术]
[14]
R.B.Li, K.Xian, C.H.Shen, et al., Deep attention-based classification network for robust depth prediction, in: Lecture Notes in Computer Science, Springer International Publishing,2019, pp.663-678. [百度学术]
[15]
N.Sarafianos, X.Xu, I.A.Kakadiaris, Deep imbalanced attribute classification using visual attention aggregation, in: Lecture Notes in Computer Science, Springer International Publishing, 2018, pp.708-725. [百度学术]
[16]
H.W.Ge, Z.H.Yan, W.H.Yu, et al., An attention mechanism based convolutional LSTM network for video action recognition,Multimed.Tools Appl.78 (14) (2019) 20533-20556. [百度学术]
[17]
B.H.Chen, W.H.Deng, J.N.Hu, Mixed high-order attention network for person re-identification,in:Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2019 in Seoul, Republic of Korea, 2019, pp.371-381. [百度学术]
[18]
A.Vaswani, N.Shazeer, N.Parmarh, et al., Attention is all you need, Adv.Neural Inf.Process.Syst.(2017). [百度学术]
[19]
H.S.Zhao, Y.H.Peng, B.C.Liu, et al., (2022) Super-resolution reconstruction of electric equipment’s thermal imaging based on generative adversarial network with edge-attention,Proc.CSEE 42(10) (2022) 3564-3573. [百度学术]
[20]
S.H.Lu, A classification model of power equipment defect record texts based on multi-head attention RCNN network, GuangXi University, 2021. [百度学术]
[21]
Z.H.Wu,W.H.Xiong,J.F.Ren,et al.,Corrosion object detection of power equipment based on lightweight SSD, Comput.Syst.Appl.29 (2) (2020) 262-267. [百度学术]
[22]
J.L.Wan,Research and application of image recognition algorithm for power equipment based on deep learning, Zhejiang University,2021. [百度学术]
[23]
Y.Liu,Z.B.Zhang,W.Zhang,et al.,Nameplate identification for power devices based on visual attention model,Chinese J.Electron Devices 45 (3) (2022) 623-627. [百度学术]
[24]
X.B.Zhang, Development of fault recognition software based on multi-source image for electrical equipment and environmental monitoring, Southeast University, 2021. [百度学术]
[25]
Q.Y.He,D.T.Pan,G.L.Li,et al.,A power equipment monitoring system based on SE-Attention image recognition model, J.Xi’an Polytechnic Univ.35 (4) (2021) 71-76. [百度学术]
[26]
P.Ramachandran,N.Parmar,A.Vaswani,et al.,Stand-alone selfattention in vision models, in: NeurIPS, 2019, pp.68-80. [百度学术]
[27]
A.Dosovitskiy,L.Beyer,A.Kolesnikov,et al.,An image is worth 16x16 words: Transformers for image recognition at scale, in:ICLR, 2020. [百度学术]
[28]
H.G.Suriyage, H.Rathnayake, End-to-end object detection with transformers:Supplementarymaterial,in:ECCV,2020,pp.213-229. [百度学术]
[29]
X.Z.Zhu, W.J.Su, L.W.Lu, et al., Deformable DETR:Deformable transformers for end-to-end object detection, in:ICLR, 2020. [百度学术]
[30]
T.Chen, S.Saxena, L.Li, et al., Pix2seq: A language modeling framework for object detection.arXiv: 2109.10852, 2021. [百度学术]
[31]
S.X.Zheng, J.C.Lu, H.S.Zhao, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformer, in: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021 in Nashville, USA, 2021, pp: 6881-6890. [百度学术]
[32]
J.N.Chen,Y.Y.Lu,Q.H.Yu,et al.,TransUNet:transformers make strong encoders for medical image segmentation:2102.04306,2021. [百度学术]
[33]
J.Y.Guo, K.Han, H.Wu, et al., CMT: convolutional neural networks meet vision transformers, in: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2022 in New Orleans, USA, 2022, pp.12175-12185. [百度学术]
[34]
Z.Dai,H.Liu,Q.V.Le,et al.,Coatnet:Marrying convolution and attention for all data sizes,Adv.Neural Inf.Proces.Syst.34(2021)3965-3977. [百度学术]
[35]
Z.L.Peng, W.Huang, S.Z.Gu, et al., Conformer: Local features coupling global representations for visual recognition, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2021 in Montreal, Canada, 2021, pp.367-376. [百度学术]
[36]
H.P.Wu,B.Xiao,N.Codella et al.,CvT:introducing convolutions to vision transformers,in:Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2021 in Montreal,Canada, 2021, pp.22-31. [百度学术]
[37]
Y.H.Wu, Y.Liu, X.Zhan, et al., P2T: pyramid pooling transformer for scene understanding, IEEE Trans.Pattern Anal.Mach.Intell.45 (11) (2023) 12760-12771. [百度学术]
[38]
W.Li, X.Wang, X.Xia, et al., SepViT: Separable vision transformer: 2203.15380, 2022. [百度学术]
[39]
Z.L.Huang, X.G.Wang, L.C.Huang, et al., CCNet: Criss-cross attention for semantic segmentation,in:Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2019 in Seoul, Republic of Korea, 2019, pp.603-612. [百度学术]
[40]
Y.Cao, J.R.Xu, S.Lin, et al., GCNet: Non-local networks meet squeeze-excitation networks and beyond,in:Proceedings of IEEE/CVF International Conference on Computer Vision Workshop(ICCVW), 2019 in Seoul, Republic of Korea, 2019. [百度学术]
[41]
C.Wah, S.Branson, P.Welinder, et al., The caltech-ucsd birds-200-2011 dataset, 2011. [百度学术]
[42]
A.Krizhevsky, G.Hinton, Learning multiple layers of features from tiny images, 2009. [百度学术]

Fund Information

Author

Yifeng Han

Yifeng Han received the B.E.degrees in automation in 2017 from Zhejiang University.He received the M.E.degrees in control system from Imperial College London in 2018, and received the Ph.D degree from Zhejiang University in 2024.His current research interests cover image processing and artificial neural network.
Donglian Qi

Donglian Qi received the Ph.D.degree from the School of Electrical Engineering, Zhejiang University, China, in 2002.She is currently a professor and a Ph.D.Advisor with Zhejiang University.Her current recent research interest covers intelligent information processing,chaos system, and nonlinear theory and application.
Yunfeng Yan

Yunfeng Yan received the Ph.D.degree from the School of Electrical Engineering, Zhejiang University, China, in 2019.She is currently an associate professor in Zhejiang University.Her current recent research interest covers intelligent power system and image processing technology.

Publish Info

Received：

Accepted：

Pubulished：2025-02-25

Reference： Yifeng Han,Donglian Qi,Yunfeng Yan,(2025) Self-reduction multi-head attention module for defect recognition of power equipment in substation☆.Global Energy Interconnection,8(1):82-91.

(Editor Zedong Zhang)

Contents

Figure（0）

Tables（0）

Recommended articles：

Global Energy Interconnection

Self-reduction multi-head attention module for defect recognition of power equipment in substation☆

Keywords

Abstract

0 Introduction

1 Related work

1.1 Visual attention

1.2 Visual attention-based intelligent inspection of power equipment

1.3 Transformer and multi-head attention

2 Method

2.1 Self-reduction multi-head attention module

2.2 Intelligent defect recognition method

3 Experiments

3.1 Training dataset

3.2 Experiments on computational complexity

3.3 Experiments on defect recognition accuracy

3.4 Ablation study

3.5 Limitation analysis

4 Conclusion

References

Fund Information

Author

Yifeng Han

Donglian Qi

Yunfeng Yan

Publish Info