Fair hierarchical clustering of substations based on Gini coefficient

doi:10.1016/j.gloei.2022.01.009

Figure（0）

Tables（0）

Author Information

Publication Information

Fair hierarchical clustering of substations based on Gini coefficient

Dajun Si¹ ,Wenyue Hu² ,Zilin Deng² ,Yanhui Xu²

（ 1.Yunnan Electric Power Grid Ltd, Kunming 650011, P.R.China , 2.North China Electric Power University, Beijing 102206, P.R.China ）

DOI:10.1016/j.gloei.2022.01.009

Keywords

Load modeling; Substation clustering; Gini coefficient; Hierarchical clustering; Contour coefficient.

Abstract

For the load modeling of a large power grid, the large number of substations covered by it must be segregated into several categories and, thereafter, a load model built for each type.To address the problem of skewed clustering tree in the classical hierarchical clustering method used for categorizing substations, a fair hierarchical clustering method is proposed in this paper.First, the fairness index is defined based on the Gini coefficient.Thereafter, a hierarchical clustering method is proposed based on the fairness index.Finally, the clustering results are evaluated using the contour coefficient and the t-SNE two-dimensional plane map.The substations clustering example of a real large power grid considered in this paper illustrates that the proposed fair hierarchical clustering method can effectively address the problem of the skewed clustering tree with high accuracy.

0 Introduction

Decisions in power system planning and operation depend on time-domain simulation, and the accuracy of the simulation results depend heavily on the accuracy of the model employed for the simulation.Because the load has the characteristics of complexity, distribution, time variation and so on, it is very challenging to build an accurate load model.The primary methods of load modeling include the component and measurement-based methods [1].A power grid consists of a large number of substations; in the above-mentioned methods, these are divided into several categories, and a load model is built for each category to form the load model parameter database for the entire power grid.Based on statistical data, the substations with similar load composition are classified into a category through cluster analysis.Such a load model can maximize the generalization ability [2].

The clustering of the substations helps achieve two objectives: to determine the feature vectors for clustering and to select the appropriate clustering method.The feature vector of the sample represents the selection of sample information from the common features of the samples to be clustered; the selection of concise and reasonable features can improve the clustering efficiency and ensure accuracy.If inappropriate sample features are used, the pattern relationship between the different samples will not be clear, resulting in the loss of accuracy and rationality of the clustering results.Ideal feature selection can not only be used as a basis to distinguish among the different samples but also has a strong resistance to noise.However, specific analyses of specific problems are required for the selection of feature vectors that would cater to specific requirements, and there is no absolutely correct and reasonable selection method.Research indicates that the feature vectors selected for the load classification in a power system may include but are not limited to the following:

(1) Comprehensive initial condition factors: The feature vector may include several influencing factors such as the initial value of voltage, voltage change, initial value of active power, initial value of reactive power, season [3], date, time [4] [5], and climatic parameters.However, the use of this method will increase the statistical workload, and the correlation of load composition will be complex.Thus, it is difficult to ensure that the loads can factor in all the abovementioned factors.Therefore, this method is generally used as an auxiliary analysis.

(2) Measured load response data: The feature vectors are the relevant parameters obtained within the measured responses of different load disturbance data [6].This method is suitable for most cases, but it is limited by the actual operating conditions of the power grid.Disturbances in the power grid are random, and the time, intensity, and voltage amplitude may not be suitable as feature vectors.Therefore, to a certain degree, this method is practically challenging.

(3) Measured model response data: The feature vectors consist of the simulation model response data under standard voltage excitation [7].All types of load disturbances are modeled under the application of a standard voltage excitation, and the active and reactive power responses of the load are used as feature vectors.This method is effective, but certain algorithms require a long time and a large workload, and imprecise models can easily cause errors.

(4) Load component composition data: The feature vectors consist of the load component composition data, that is, the statistics are based on the composition of the main electrified equipment in the substation, and the proportions of the various load components in the substation are used as the feature vectors.This method is more effective and easy to implement.

In this paper, the proportions of the various industries in the substations are used as the clustering feature vectors.When compared with the other methods, practically, data on the proportions of the various constituent industries can be obtained more easily, and the model can be modified based on the variation in the load.This method has strong applicability in engineering.

The clustering method of the samples refers to the algorithm used for clustering.The accuracy of the model and the speed of calculation can be improved using appropriate algorithms.Four clustering methods are introduced in reference [8], and it should be noted that hierarchical clustering does not require the determination of the number of clusters in advance as for other clustering methods.Three hierarchical clustering algorithms are introduced in reference [9].When compared with other clustering methods, the hierarchical clustering method not only is suitable for data sets with arbitrary attributes and shapes but also can flexibly control the clustering granularity of different levels.Therefore, it has strong clustering ability.The research status of clustering technology is reviewed and the principles and characteristics of the various clustering methods are compared in reference [10].

Hierarchical clustering is of significant value in power systems.The principle of hierarchical clustering is relatively simple and has strong applicability to data.Hence, it can easily be used in the field of power systems with high data complexity.This method has been adopted in power system load forecasting [11-13], new energy generation forecasting [14], dynamic frequency prediction after disturbance [15], energy storage system optimization [16], equipment fault diagnosis and other fields [17], and clustering of load substations and other fields, which indicates that it has a certain applicability in power systems.

The commonly used methods for clustering load substations are the hierarchical clustering algorithm [18] [19], partition clustering algorithm [20], density-based clustering algorithm [21], and so on.Reference [22] improves the application of the hierarchical clustering algorithm in load clustering, and verifies that better clustering results can be obtained in shorter time.Reference [23] applies hierarchical clustering to load clustering, points out some shortcomings of this method, and combines other clustering methods.The hierarchical clustering method has a significant application prospect in the clustering of load substations.However, further research is required to improve the accuracy of clustering.

In each clustering process, clusters with larger numbers of samples often have larger probabilities of participating in the process.Therefore, the clusters with a larger number of samples in different clusters become larger and larger, whereas the samples that have not participated in clustering are often integrated into a large cluster in the form of individuals.Finally, there is the problem of tilting the hierarchical trees view.

In this study, the proportions of various industries in the substations are used as the clustering feature vector, the classical hierarchical clustering method is improved, and a fair hierarchical clustering method based on the Gini coefficient is proposed.The example of an actual large power grid illustrates that the proposed method can improve the accuracy of clustering.

1 Gini coefficient

The Gini coefficient is an index proposed by the Italian economist Gini at the beginning of the 20th century to determine the degree of equality of distribution based on the Lorentz curve [24].As depicted in Fig.1, the horizontal axis represents the members of the entire society from lowto-high-income groups.The vertical axis represents the cumulative income share of the entire society.

Fig.1 Lorentz curve

The straight line OM represents the absolute equality line, that is, on this line, with an increase of 10% in the members of the entire society from point O to point P, their income share is also 10% of the wealth of the entire society.Thus, OM denotes the absolute average of the entire society.The broken line OPM is the absolute inequality line, which indicates that the income share of the entire society is controlled by one person; the curve OM between the straight line OM and the broken line OPM is termed the Lorentz curve.When the Lorentz curve is closer to the straight line OM, it means that the distribution is more equal, and when the Lorentz curve is closer to the broken line OPM, it means that the distribution is more unequal.

To better describe the degree of equality, Gini defined the area A between the straight line OM and the curve OM divided by the area between the straight line OM and the broken line OPM as the Gini coefficient, that is, G =A (A +B).The Lorentz curve is the absolute equality line when A = 0, which indicates the absolute equality of the income distribution.When A = 1, the Lorentz curve is the absolute inequality line, which indicates the maximum inequality of the income distribution.Before the value of the Gini coefficient lies within the range 0-1, the closer it is to 0 indicates that the income distribution is more equal and, on the contrary, the closer it is to 1 indicates that the income distribution is more unequal.

The income is converted into the number of samples in the cluster to calculate the Gini coefficient.ci represents the number of samples contained in the ith set and n represents the number of clusters participating in this clustering.The clusters are sorted based on the number of samples in the cluster—c1 are the least and cn are the most.In Fig.1, based on the definition of the Gini coefficient [25], the area of the triangle (A + B) = 0.5.Therefore,

B is divided into n - 1 right-angled trapezoids by n samples, and the area of the right-angled trapezoids is approximately replaced by the curved trapezoid area

The standardized Gini coefficient is expressed as follows:

where C( )k represents all the clusters participating in the kth clustering.

2 Fair hierarchical clustering algorithm

2.1 Classical hierarchical clustering algorithm

The classical hierarchical clustering algorithm is a clustering method based on the association rules between data, which use a hierarchical structure to split or aggregate samples to obtain the final solution [26].Its characteristic is that it does not require the number of classes to be set, and it can present the clustering process and results by means of a hierarchical tree graph.Based on the different ways of clustering, hierarchical clustering can be divided into topdown and bottom-up clustering.Top-down hierarchical clustering: first, all the samples are classified into one cluster, and in each successive calculation, each cluster is gradually divided into two smaller clusters until each cluster is a sample or meets certain conditions, at which point the algorithm ends.Bottom-up hierarchical clustering: first, each sample is classified into a cluster; by comparing the distance between different clusters, the two clusters participating in the clustering are merged step by step to form a larger cluster until there are several samples in one cluster or the algorithm reaches the end condition.

In this study, using the bottom-up clustering strategy as well as the bottom-up hierarchical clustering method, the steps for condensing and hierarchical clustering of n samples are as follows:

(1) X is a sample set, in which X = { x1 ,x2, …,xn} is divided by C = {C 1 ,C2, …,Cn}and Ci ={ xi} must satisfy which indicates that the union of each Ci is X.

(2) To calculate the distance between the samples, we may use fixed measurement strategies, such as Euclidean distance and Mahalanobis distance.In this study, the Euclidean distance is used, as follows:

where x1 and x2 represent samples 1 and 2, pagenumber_ebook=53,pagenumber_book=579 and represent the ith feature vector in samples 1 and 2, and p represents the number of feature vectors in the sample.

(3) To measure the distance between clusters, the commonly used metrics between clusters are the singlelink, full-link, and average-link standards.Any one standard can be used to construct the nth-order matrix of distance measurement, as follows:

The element in the distance metric matrix D refers to the distance measure between the ith and jth clusters, and the diagonal elements of the matrix D represent the distance measured between the cluster and itself; so, the diagonal elements are 0.When the distance dij is smaller, it means that the two clusters should be grouped together.

After the clustering is completed, the two nearest clusters are merged, the dimension of the distance metric matrix D is changed to n - 1, we return to step (3) to remeasure the distance between the clusters and merge two clusters with the shortest distance between them, and repeat this process until the end of the algorithm.

The classical hierarchical clustering flowchart is presented in Fig.2.

Fig.2 Flowchart depicting the classical hierarchical clustering method

Because each cluster often contains multiple samples, the distance metrics between different pairs of clusters are also different.As mentioned above, there are three most commonly used metrics, namely, the single-link, full-link, and average-link standards.

(1) Single-link standard refers to using the shortest distance between two samples from two different clusters as the metric distance between the clusters, as follows:

where Ci represents the ith cluster and x1 and x2 represent samples in clusters Ci and Cj, respectively.The singlelink standard method can easily lead to the merging of two clusters that should not be merged.For example, let us consider a case when the two clusters are far away from each other as a whole but the two samples in the clusters Ci and Cj are very close.This standard uses the distance between these two points as the distance metric between the clusters Ci and Cj.Such a clustering effect is not ideal.

(2) Full-link standard refers to using the largest distance between two samples belonging to two different clusters as the metric distance between the clusters, as follows:

The full-link standard is the opposite of the single-link standard.When two clusters that are close to each other should be merged, the algorithm considers the distance between two samples with the farthest distance as the cluster degree distance, thus preventing the merging of the two clusters.

(3) Average-link standard refers to measuring the distances between pairs of samples belonging to two different clusters and, thereafter, averaging these values to obtain the metric distance between the two clusters, as follows:

The average-link standard calculates the distance between each pair of samples belonging to two clusters and considers the average value of these distances as the distance between the two clusters.This method produces relatively more appropriate results than the other two standards.

However, for hierarchical clustering, when the collected data is not suitable, no matter which link standard is adopted, there may be a “snowball” problem when a cluster contains a large number of samples.In such a case, the centroid of the cluster tree will shift to the center of this particular cluster, so that it has a greater probability of participating in the clustering process than other clusters.In contrast, the clusters containing a single sample can only be merged by these large clusters in the form of individuals, which eventually leads to the tilt of the hierarchical tree.This problem causes the clustering results to be affected by the sample selection.If different data are chosen for the same load, the clustering results will be very different, reducing the credibility of the clustering results.

2.2 Fairness index

To make the two clusters participating in each clustering process as fair as possible and reduce the impact of data selection on the clustering results, the concept of the Gini coefficient is introduced.The standardized Gini coefficient can be applied to hierarchical clustering to determine whether the number of samples in all the clusters participating in the process is equal, and the fairness index is defined as follows:

where represents the standardized Gini coefficient of the kth cluster and pagenumber_ebook=54,pagenumber_book=580 is the exc-p Gini coefficient, that is, the coefficient of all the participating clusters, except for the pth cluster.When represents the kth cluster, the standardized Gini coefficient for all the clusters participating in this process after excluding the pth cluster is expressed as follows:

Assuming that the pth cluster contains a large number of samples, under the condition that the fairness index of all the clusters are the same, the exc-p Gini coefficient of the pth cluster is the smallest, so that the fairness index of the pth cluster is the largest.

Considering the single-link standard as an example, a new fair single-link standard can be obtained by introducing the fairness index, as follows:

The essence of hierarchical clustering is to constantly select the two nearest clusters for merging.When the pth cluster contains a large number of samples, the process adopts the fair single-linked intercluster distance metric.Hence, the measurement distance between the pth cluster and other clusters becomes larger, so that it is not dominant in the clustering process.This reduces the possibility of tilting the hierarchical tree to a certain extent.

2.3 Fair hierarchical clustering method

The fair hierarchical clustering method is proposed through applying the fairness index to the traditional hierarchical clustering method.The flowchart of the fair hierarchical clustering method is presented in Fig.3.

The steps for clustering samples using the fair hierarchical clustering algorithm are as follows:

(1) The sample is defined as a single cluster, and the Euclidean distance between the samples is calculated to generate the initial measurement matrix D using (4) and (5).

(2) The standardized Gini coefficients are calculated for all clusters participating in the clustering to determine whether they are greater than the set threshold.The formula for calculating standardized Gini coefficients is expressed in (10).

(3) If the standardized Gini coefficient is less than the set threshold, the single-link standard of the original algorithm is adopted; if it is greater than the threshold, the single-link standard including the fairness index is adopted and the metric distance between the clusters is calculated as in (6) and (11).

Fig.3 Flowchart depicting the fair hierarchical clustering method

(4) The two sets with the shortest measured distance are merged to form a new set, the algorithm returns to step (2), and this process is repeated until the clustering work is finished.

(5) The result is output, and the algorithm ends.

In step (3), the threshold can be set according to the sample situation of the region.In the clustering of substations, if the industrial load in the area is heavy, the threshold can be set higher so that the industrialheavy substations can be grouped together, whereas if the distribution of each industry in this area is more uniform, it can be set according to the normal threshold.The study indicates that when the Gini coefficient is 0.3-0.4, the distribution is more reasonable [27].

Applying the standardized Gini coefficient to hierarchical clustering, we can determine whether the numbers of samples contained in the clusters to be clustered are enough “equal” and calculate the metric distance between the clusters as in the single-link standard of the original algorithm.If they are not equal, the effect of the fairness index will be considered in calculating the singlelink marking.When a cluster contains a large number of samples, the fairness index will increase the measurement distance between it and other clusters so that it is not dominant in the clustering, and the fairness index can avoid the tilt of the tree graph to a large extent, ensuring the rationality of the clustering results.

3 Evaluation of clustering results

3.1 Contour coefficient method

The essence of clustering is to separate the samples into distinct classes based on their different characteristics.The similarity among samples within each class is as large as possible, and the similarity among samples belonging to different classes is as small as possible [28].Classification is different from clustering.In the former, the classification standard, that is, the feature vector of the statistical samples required for the task at hand is already known; often, the classification standard is either 0 or 1.The difference in clustering is that we do not know the sample characteristics on which the clustering work is based; thus, there is no standard solution.The evaluation index of the contour coefficient is introduced in this method so that the clustering results can be analyzed quantitatively.

Suppose that after the clustering work is finished, the samples xi are divided into the set p, that is, if there are xi ∈ Cp, the average distance A ( xi) between the samples xi and other samples in the set p can be obtained as follows:

where cp is the number of samples contained in the set p, dist x ( xij, ) is the distance between the sample i in the set p and the sample j, and the Euclidean distance can be used as in (4).Therefore, the degree of association with the set p is described by A x( i).

For the degree of association with other sets, B ( xi) is used to express the minimum value of the average sample distance to other sets except set p as follows:

where ck is the number of samples contained in set k and B x( i) describes the minimum distance to other sets, that is, the nearest distance from xi to some other set.

Therefore, the profile coefficient of the sample xi is obtained as follows:

The contour coefficient S of the clustering result can be determined by averaging all the samples.It is evident from the definition of the sample contour coefficient that 1≤-1≤S.When S is closer to 1, it is reasonable for the sample to be assigned to the set to which it belongs; when S is closer to -1, it means that the sample should not be assigned to the set.

3.2 t-SNE dimensionality reduction algorithm

The improved fair hierarchical clustering method can prevent the tilt of the hierarchical clustering tree to a certain extent and improve the rationality of the overall clustering, but there may be errors at individual sample points.All the samples involved in the clustering are mapped to the two-dimensional plane to more accurately determine the errors that belong to set A but are surrounded by set B and improve the accuracy of the clustering results.Further, this method determines the accuracy of the clustering results; if there are several errors, the clustering accuracy is not high.We use this method primarily to compare the accuracy of the clustering results.In this section, we introduce the dimensionality reduction algorithm of t-distributed random adjacency embedding (t-distributed stochastic neighbor embedding, t-SNE).

t-SNE is a commonly used dimensionality reduction algorithm, which was first used in image processing.This algorithm can not only solve the problem of data congestion and improve the degree of visualization in the process of dimensionality reduction but also maximize the retention of the sample data structure to realize the expression of highdimensional sample data in low-dimensional space.

The basic idea is to transform the distance between the samples in high-dimensional space into a probability distribution using the method of Gaussian distribution and to transform the distance between the samples in lowdimensional space into a probability distribution using the method of t-distribution [29].Thus, the joint probability of the two methods is used to express the similarity of the sample points so that when the sample points of the highdimensional space are mapped to the low-dimensional space, the probabilities can be maintained constant to obtain the distribution of the samples in low-dimensional space.

Let the set of sample points in high-dimensional space be X = { x1 , x2, …,xn} and the set of mapping points in lowdimensional space be Y = { y1 ,y2, …,yn}.Thus, we express the dispersion as follows:

where qji and pij are the joint probabilities of the distributions of the sample points in low- and high-dimensional space, respectively.The joint probability function of the sum pij of the sample points xi and xj in the high-dimensional space X is expressed as follows, where σ is the Gaussian variance when xi is at the center:

The joint probability function in high-dimensional space can be obtained as follows:

The joint probability function in low-dimensional space is expressed as follows:

The gradient descent method is used to minimize the cost function as follows:

The update rule of the gradient is as follows:

where Y( )T is the coordinate after the sample point of the high-dimensional space is mapped to the low-dimensional space, the number of iterations is expressed as T, φ(T) is the momentum factor, and λ is a factor representing the learning ability.

The steps for projecting a substation sample from a highdimensional to a two-dimensional plane using t-SNE are as follows:

(1) The number of substations and the feature vectors are input, and the joint probability of the distribution of the sample points in the high-dimensional space pij is calculated.

(2) The mapping set Y in low-dimensional space is initialized randomly with a normal distribution, and the initial value Y(0) is obtained.

(3) The joint probability function qij is calculated after the sample points are projected into low-dimensional space.

(4) The gradient is calculated, the low-dimensional space coordinates of this iteration Y( )T are obtained, and the algorithm returns to step (3) until the end of the iteration.

(5) The image representation of the substation is obtained in two-dimensional space Y={ y1 ,y2, …,yn} and the image is drawn.

4 Case study

4.1 Fair hierarchical clustering analysis

In this section, we describe the use of the fair hierarchical clustering method to cluster the substations of an actual large power grid and compare the results with those obtained using the traditional hierarchical clustering method.To present the comparison results clearly, the contour coefficients are calculated for the two methods at the end of the clustering work and the t-SNE algorithm is introduced to express the data of multidimensional space in two-dimensional space, which elucidates the advantages and disadvantages of the two clustering methods.

The data selected for this case study are the statistical results obtained from the investigation of the load composition of 183 220 kV substations belonging to a large regional power grid.The load on each substation is divided into five categories, including industrial, agricultural, commercial, residential, and other loads.The threshold of the Gini coefficient is 0.5.The clustering results obtained using the fair hierarchical clustering algorithm are presented in Table 1.

The substation with the largest profile coefficient in each category of substations is selected as the clustering center, and the classification is as follows:

Category 1: Primarily industry, commerce and residents, the comparison among the three is the average, and the clustering center is Substation No.73.

Category 2: Residents are dominant (accounting for more than 50%), industry is secondary, and the clustering center is Substation No.68.

Category 3: Primarily industry and residents (the total proportion is more than 70%), supplemented by commerce and agriculture, and the clustering center is Substation No.82.

Category 4: Industry is dominant (accounting for more than 70%), agriculture is secondary and basically does not involve residents and commerce, and the clustering center is Substation No.37.

Category 5: Commercial-oriented (accounting for more than 50%), industry and agriculture accounts for an average, and the clustering center is Substation No.67.

Category 6: Industry and commerce are dominant, the comparison between the two is average, and the clustering center is Substation No.85.

Category 7: Agriculture is dominant (accounting for more than 50%), industry is secondary, and the clustering center is Substation No.172.

Table 1 Results of the fair hierarchical clustering method

Category Substation number 1 66, 70, 71, 73, 76, 77, 79, 83, 84 2 68, 69 3 74, 75, 81, 78, 80, 82, 176, 177, 93, 178, 179, 180, 181, 98, 88,116, 129, 109, 110, 101, 97, 86, 96, 112, 126, 118, 130, 120, 124,121, 122, 108, 102, 103, 90, 105, 106, 99, 94, 95, 89, 100, 91, 87 4 19, 37, 173, 174, 175 5 67 6 72, 85 7 171, 172 8 132, 182, 104, 117, 133, 111, 119, 115, 159, 123, 140, 143, 135, 145, 149, 146, 148, 128, 136, 137, 138, 158, 156, 150, 155, 134, 139, 142, 17, 6, 15, 3, 161, 167, 13, 10,113, 166, 125168, 169, 114, 157, 160, 165, 107, 28, 152, 12, 144, 2, 92, 52, 65, 61,60, 59, 58, 56, 57, 53, 54, 55, 153, 1, 18, 9, 151, 62, 8, 44, 41, 38, 63, 154, 162, 24,23, 22, 20, 21, 39, 40, 36, 25, 14, 35, 34, 33, 32, 30, 31, 27, 164, 49, 64, 51, 48, 50,26, 43, 29, 47, 183, 42, 45, 46, 4, 16, 5, 170, 7, 11, 127, 131, 141, 147, 163

Category 8: Industry is dominant (more than 60%), residents are secondary, commerce and agriculture accounts for an average, and the clustering center is Substation No.134.

The clustering tree diagram for the fair hierarchical clustering algorithm is depicted in Fig.4, and the red dotdashed line represents the clustering line.After obtaining the clustering results, we can select a point on the vertical axis as the horizontal line based on the actual situation of the area, and the intersection of the horizontal line and each branch represents all the sample points contained in the class.

4.2 Evaluation of clustering results

To illustrate the superiority of the fair hierarchical clustering method, the traditional hierarchical clustering method is used to cluster using the same example, and the contour coefficients of the two clustering results are compared.The overall contour coefficient obtained using the fair hierarchical clustering method is 0.6472, whereas that of the traditional hierarchical clustering method is 0.5143, which indicates that the former method has a better clustering accuracy than the latter.

Fig.4 Fair hierarchical clustering tree

The clustering results obtained using the fair and traditional hierarchical clustering methods are reduced using the t-SNE algorithm, and the two-dimensional t-SNE diagrams are displayed in Fig.5 and Fig.6.

Comparing Fig.5 and Fig.6, we can see that in the t-SNE diagram for the traditional hierarchical clustering method, there are individual points crossing and overlapping between the sets of the two colors, which indicates that such sampling may lead to clustering errors and needs to be further improved; the fair hierarchical clustering method proposed in this paper does not encounter this problem and produces a better clustering effect.

4.3 Comparison of actual disturbance curves

Based on the results of the cluster analysis, the parameters of each cluster center substation are identified, and the load model of the regional power grid is substituted into the measured voltage disturbance data.The measured active and reactive power data are compared with the fitting data obtained from the traditional 334 model.The results of the comparison are presented in Fig.7.

It is evident that the load model established using fair hierarchical clustering improves the accuracy of the model when compared with the traditional 334 model.This suggests that fair hierarchical clustering has application value in substation clustering and load modeling.

Fig.5 Two-dimensional t-SNE diagram representing the results obtained using fair hierarchical clustering

Fig.6 Two-dimensional t-SNE diagram representing the results obtained using traditional hierarchical clustering

5 Conclusion

Fig.7 Comparison of actual disturbance curves between the fair hierarchical clustering and classical load models

In this paper, with the aim of clustering substations for the load modeling of a large scale power grid, a fair hierarchical clustering method based on the Gini coefficient is proposed and applied to the clustering of the substations of an actual large power grid.The fair hierarchical clustering method improves the distance metric between sets of sample features by defining a fairness index, which can prevent the skew of the clustering tree.The calculated contour coefficients and t-SNE two-dimensional diagrams indicate that the fair hierarchical clustering method is more effective than the traditional hierarchical clustering method.Comparing the actual disturbance curves obtained from the load model built using the fair hierarchical clustering method and the traditional 334-load model; we conclude that fair hierarchical clustering is also of significant value in practical application.

Acknowledgements

This work was supported by the Major Science and Technology Project of Yunnan Province entitled “Research and Application of Key Technologies of Power Grid Operation Analysis and Protection Control for Improving Green Power Consumption” (202002AF080001) and the China South Power Grid Science and Technology Project entitled “Research on Load Model and Modeling Method of Yunnan Power Grid” (YNKJXM20180017).

Declaration of Competing Interest

We declare that we have no conflict of interest.

Fund Information

Author

Dajun Si

Dajun Si received his PhD at the Harbin Institute of Technology in 2005.He is working in Yunnan Electric Power Grid Ltd.His research interests include power system stability analysis and electric power grid planning.
Wenyue Hu

Wenyue Hu received his Master degree at North China Electric Power University in 2020.His research interests include load modeling.
Zilin Deng

Zilin Deng is working towards his PhD at North China Electric Power University.His research interests include load modeling.
Yanhui Xu

Yanhui Xu is the corresponding author.He received his PhD at North China Electric Power University in 2010.He is working in North China Electric Power University.His research interests include dynamic power system analysis and load modeling.

Publish Info

Received：2021-09-18

Accepted：2021-11-30

Pubulished：2021-12-25

Reference： Dajun Si,Wenyue Hu,Zilin Deng,et al.(2021) Fair hierarchical clustering of substations based on Gini coefficient.Global Energy Interconnection,4(6):576-586.

(Editor Yanbo Wang)

Contents

Figure（0）

Tables（0）

Recommended articles：

Global Energy Interconnection

Fair hierarchical clustering of substations based on Gini coefficient

Keywords

Abstract

0 Introduction

1 Gini coefficient

2 Fair hierarchical clustering algorithm

2.1 Classical hierarchical clustering algorithm

2.2 Fairness index

2.3 Fair hierarchical clustering method

3 Evaluation of clustering results

3.1 Contour coefficient method

3.2 t-SNE dimensionality reduction algorithm

4 Case study

4.1 Fair hierarchical clustering analysis

4.2 Evaluation of clustering results

4.3 Comparison of actual disturbance curves

5 Conclusion

Fund Information

Author

Dajun Si

Wenyue Hu

Zilin Deng

Yanhui Xu

Publish Info