Recommended articles：

Global Energy Interconnection
Volume 3, Issue 3, Jun 2020, Pages 227236
NashQ learningbased collaborative dispatch strategy for interconnected power systems
Abstract
The largescale utilization and sharing of renewable energy in interconnected systems is crucial for realizing “instrumented,interconnected,and intelligent” power grids.The traditional optimal dispatch method can not coordinate the economic benefits of all the stakeholders from multiple regions of the transmission network,comprehensively.Hence,this study proposes a largescale windpower coordinated consumption strategy based on the NashQ method and establishes an economic dispatch model for interconnected systems considering the uncertainty of wind power,with optimal windpower consumption as the objective for redistributing the shared benefits between regions.Initially,based on the equivalent cost of the interests of stakeholders from different regions,the state decision models are respectively constructed,and the noncooperative game Nash equilibrium model is established.The Qlearning algorithm is then introduced for highdimension decision variables in the game model,and the dispatch solution methods for interconnected systems are presented,integrating the noncooperative game Nash equilibrium and Qlearning algorithm.Finally,the proposed method is verified through the modified IEEE 39bus interconnection system,and it is established that this method achieves reasonable distribution of interests between regions and promotes largescale consumption of wind power.
1 Introduction
Wind energy has become one of the most promising new energy sources in the context of global energy interconnection.In China,as the distance between the wind power resources and load center is considerable,largescale windpower grid integration leads to shortage of backup resources in the regional power grid,rendering complete consumption of the wind power difficult on the spot.Besides,the fluctuation and antipeak characteristics lead to insufficient system peak regulation capacity,resulting in serious windpower curtailment [1].Therefore,the maximization of windpower consumption during grid dispatch has become a research hotspot [2–4].
With the development and expansion of the power system,“instrumentation,interconnection,and intelligence” is the development direction of the power grid,and each isolated power system will inevitably become interconnected through the tieline [5].Furthermore,the concept of crossregional allocation of new energy between nations has been proposed [6–8].With the traditional centralized algorithm,it is difficult to effectively solve the problems of hierarchical and partitioned dispatch in an interconnected power grid.Hence,robust optimization [9]has been adopted to describe windpower output,and a decentralized and coordinated scheduling model based on the objective cascade analysis method has been proposed for the decentralized and coordinated dispatch of an interconnected power system.Moreover,the synchronous alternating direction multiplier method has been applied to generate the optimization model for each region,and the largescale consumption of windpower dispatch involving energy storage has been considered [10].A cutplane consistency algorithm has been proposed to decentralize the economic dispatch models in interconnected areas [11]; however,timely deletion of the nonfunctional cutplanes during the iteration process is necessary.To speedup the convergence speed,the Ward equivalence method has been applied for decomposing interconnected systems,and the objective function has been processed using the modified generalized Benders decomposition method [12].In addition,a distributed algorithm has been proposed based on Newton’s method to solve the problem of multiregional economic scheduling [13],which can reduce the iteration time and effectively increase the calculation efficiency.
The aforementioned studies on interconnected system scheduling mainly focus on problems such as largescale complex constraints and the difficulty in combining windpower interconnected power grids with security constraints; hence,they mainly adopt mathematical methods for objective function optimization.In most available studies,in case of multiregional participation in dispatch,the regions are regarded as economic interest communities with the same objective function in pursuit of overall maximum interest,unreasonably distributing the benefits and even sacrificing the interests of certain regions.The coordinated dispatch of interconnected systems is a process in which two regions pursue their own interests to maximize not only the economic interests but also the overall social and economic interests.Therefore,game theory [14–16]can be used to analyze the respective interests of the interconnected regions and the Nash equilibrium that the two may achieve.Additionally,the aforementioned studies optimize the power output of the unit through mathematical models based on the load and windpower forecast data at each moment,in isolation,ignoring the fact that power system dispatch is a continuous and repeated process.For each process,the uncertainty prediction and windpower dispatch can be based on experience.Therefore,reinforcement learning [17–19]provides a new solution for using such experience to optimize the power output of the unit more efficiently,maximize windpower consumption,and complete the game between regions based on system experience and conditions.With online learning ability,the windpower consumption capacity of an integrated wind storage system has been strengthened [20].For coordination and dispatch in an integrated energy microgrid,a dispatch model has been constructed based on the multiagent game and Qlearning algorithm to maximize the operating incomes and reduce the costs of all the parties [21].
This study proposes a largescale windpower coordinated consumption strategy to realize the optimal operation of interconnected systems.The main contributions of this study are summarized as follows:
• A coordinated economic dispatch model based on the noncooperative game Nash equilibrium,considering the interconnected windpower uncertainty,is established.The interactions between the regions and tieline for various operational cases are presented in detail.
• The Qlearning algorithm is used to optimize the state of two regions,separately,to determine the Nash equilibrium of the largest fleet of wind power,most economic unit combination,and tieline power output strategies.
• By analyzing the modified IEEE 39bus interconnection system,the NashQ learning method proposed in this study is compared with the traditional dispatch method,establishing that the proposed method achieves reasonable distribution of interests among regions and promotes largescale consumption of wind power.
The reminder of this paper is organized as follows.Sections 2 and 3 introduce the detailed dispatch model and NashQ learning method,respectively.The performed case study and discussions are presented in Section 4.Section 5 highlights the conclusions of this study.
2 Coordinated dispatch model based on noncooperative game Nash equilibrium
2.1 Tworegion economic optimization decision model
In interconnected system dispatch,for a certain area,the startup/shutdown of conventional thermal power units can be arranged based on load data,windpower forecast,and tieline schedule to calculate the equivalent economic cost of each region,and that of the entire system,subsequently.In the dispatch process of an interconnected system,one region does not necessarily cause another region to suffer the same economic loss,while optimizing its equivalent economic cost.The two interconnected parties may have strategies to achieve a common interest balance,i.e.,to further emphasize the maximization of their own benefits from the perspective of cooperation; hence,the coordination of the economic interests of the two regions constitutes a noncooperative game Nash equilibrium model.This study focuses on improving and emphasizing the autonomy of the region itself,and the collaboration and cooperation between regions.The startup/shutdown strategies are separately formulated in the two regions to realize coordination and cooperation between them through the adjustment of the tieline power.The game factor is defined as the equivalent cost of each region,assuming that region I is a highgeneration area for wind power,and region J is a windpower consumption area; the following equations result:
where fIand fJare the equivalent cost functions of the region Iand region Jgame players,respectively; Tis the number of dispatch periods; NIGand NJGare the total number of thermal power units in each region,respectively; NWis the number of wind farms in region I; are the startup states of thermal power unit nin each period,which range from 0–1; FInand FJnare the coal consumption function of each region,respectively; are the output of thermal power unit nin each period; are the starting costs of thermal power unit nin each period of each region; CW is the penalty cost coefficient of abandoned wind; ΔPW j,tis the abandoned wind power of wind turbine j in each period; γ is the tieline price; Pl tis the power on the tieline; Sn,hotand Sn,clodare the hot and cold startup cost,respectively; Tn,t,offis the continuous downtime of each period of unit n; Tn,minis the minimum stop time; Tn,coldis the cold start time; an,bn,and cnare the operating cost coefficients of unit n; PW j,tis the predicted wind power of wind turbine j in each period; and is the wind power of wind turbine jparalleling in the grid in each period.
2.2 Constraints
The unit output constraints are as follows:
where and are the minimum and maximum technical outputs of unit n,respectively.
The unit ramp rate constraints are as follows:
where RD,nand RU,nare the up and down climbing rates of unit n,respectively.
The minimum startoff time constraints are as follows:
where Tn,on and Tn,off are the minimum on and off time of unit n,respectively.
The constructed windpower output uncertainty set is expressed as follows [22]:
where are the maximum and minimum predicted outputs,respectively,of wind farm j in each period; range from 0 to 1; ΓTis the maximum number of times that the specific wind farm jis forced to reach the predicted boundary value within the dispatch period T; and ΓSis the maximum number of times that all the wind farms reach the predicted boundary value in each dispatch period t.The power balance constraints are as follows:
wheretχ is the power flow direction of the interconnected regions.If plt is the same as the specified positive direction,and Dt is the predicted value of the load in the respective area.
The load demand constraints are as follows:
where ris the load reserve ratio of the respective area.
The transmission capacity constraints are as follows:
where Pjmaxis the maximum windpower prediction value in a dispatch period;tη is the proportional control coefficient in the adjustment period,0≤tη ≤1; and Plmax is the maximum limit of the transmission power on the tieline.
The aforementioned interconnected system economic dispatch model,considering windpower uncertainty,is a multivariate mixedinteger nonlinear optimization problem; due to computational limitations,it is difficult to determine the optimal solution for this model.In this study,the method presented in [23]is used to linearize the coal consumption and startup cost function,as well as the nonlinear constraint variables in the constraint function; this problem is then transformed into a mixedinteger linear one.
2.3 Interarea Nash equilibrium
Regions I and J constitute a noncooperative gamebased model in the process of pursuing their own best interests.The Nash equilibrium points achieved by each region independently are the optimal strategies of their own objective functions [24–25].The following can be expressed as
where G is the equilibrium point of the game; g is the game function; Igis the individual involved in the game; f*gis the Nash equilibrium winning function of the players participating in the game,namely,the equivalent cost; A*gis the Nash equilibrium action strategy set of the game player; is the Nash equilibrium strategy for the startup output of the units in region I; A*J,tis the Nash equilibrium strategy for the startup output of the units in region J; and Ptl*is the Nash equilibrium strategy for the tieline power values.Equations (15) and (16) represent a region’s own optimal strategy,when the other selects the optimal strategy.
The process of solving the Nash equilibrium problem is described as follows:
Step 1:Input raw data,including the load prediction data,windpower prediction data,and the various data and parameters required by objective functions fIand fJ.
Step 2:The initial value of the Nash equilibrium solution is given.
Step 3:Iterative search: The kth round optimization result is solved according to its own equivalent cost objective function and the optimization results of the previous round,i.e.,
Step 4: Determine whether the Nash equilibrium solution has been found.If the kth round optimization result is consistent with the (k1)th round,i.e.,
If the equation is satisfied,the Nash equilibrium solution has been found,i.e.,the Nash equilibrium equivalent costand f*Jof each region,the tieline power value at each moment,and the startup output strategies A*I,tand A*J,t can be determined.
3 Interconnected system economic dispatch based on the NashQ learning method
3.1 Basic principle of the Qlearning algorithm
Reinforcement learning is a process of repeatedly learning and repetitively interacting with the environment to strengthen certain decisions.In this process,as the agent acts on the environment through action A,the environment changes and the reward value Ris generated.The agent receives the reward value R,and selects the next action to be performed according to the enhanced signal and the current state S,with the goal of finding the optimal strategy to accomplish the target task.A typical reinforcement learning model is depicted in Fig.1.
Fig.1 Reinforcement learning model
Qlearning is a type of reinforcement learning,commonly based on the Markov decision process [26–28].In this study,the Qlearning algorithm used in the Nash game results in more efficient commitment and dispatch decisions.It sets the previous empirical Qvalue as the initial value of the subsequent iterative calculation,improving the convergence efficiency of the algorithm.The value function and iterative process of the Qlearning algorithm are expressed as follows:
where sand s are the current and next state,respectively; Sis the state space set; β is the discount factor; R(s,s,a) is the reward value obtained by performing action afrom state sto state s' ;is the transition probability to state s' after performing action ain state s; Q(s,a) is the Qvalue of performing action a in state s; α is the learning factor; and Qk(sk,ak) is the kth iteration value of the optimal value function Q*.
3.2 Principle of the NashQ learning algorithm
In a noncooperative game,due to the constant change of state,the action strategy of the game players changes accordingly.At time t,player iiteratively learns to update the value of Qit and equalizes the Qjtvalue of player j,in addition; a Nash game equilibriumis then formed in state s.By defining the noncooperative game Nash equilibrium solution as the updated iterative equation of the NashQ learning algorithm is provided as follows [29]:
where indicates the payoff of player i in state s for the selected equilibrium;is the (k+1)th iteration value in the Qvalue function for player i; and Ri k is the kth iteration value in the immediate reward function for player i.
3.3 Selection of the state space and action strategy set
In general,there are two methods for implementing the Qfunction: the neural network method and the lookup table [26].In this study,the latter is used to implement Qlearning,where the utility or worth of each corresponding actionstate is solely expressed by aquantified value (Qvalue),which is determined by an actionvalue function.Therefore,it is necessary to first determine S and A.
In the objective function model,the state variable includes the load prediction value SLoad and the windpower prediction value SW in each period; the action variable set A includes the startup outputs aI and aJof the thermal power units in regions I and J,respectively,and the tieline power value al.Before generating the Qvalue table,we first discretize the continuous state variable and action variable to form a (state,action) pair function.The state variables can be discretized by the following expression [30]:
where ΔPiis the interval length of the ith variable; Pimax and Pimin are the maximum and minimum values of each variable,respectively; and Niis the interval number of the ith variable.All the state variables can be discretized into interval forms using (20),and the state to which each region belongs can be uniquely determined.The action strategy variable needs to be discretized into a fixed value form,after which a set of action strategies ak={aI,aJ,al} can be uniquely determined according to the unit state and interval of the tieline power value.
3.4 Coordinated dispatch based on NashQ learning
After state space S and action strategy A are determined,prelearning and online learning can be performed.Before reaching the optimal Qvalue,prelearning accumulation of experience is necessary to generate a Qtable that approximates the optimal solution; online learning can then be performed to obtain the best action strategy [31].
Fig.2 NashQ learning flow chart
4 Analysis of examples
4.1 Description of the examples
In this study,the tworegion economic dispatch model is solved through Yalmip programming in MATLAB; the Gurobi solver is applied,in addition.The modified IEEE 39bus tworegion system is used to verify the model depicted in Fig.A1.This interconnected system contains four wind turbines and 20 thermal power units.All the four wind turbines are located in region I.Fig.3 displays the windpower prediction value,and its upper and lower bounds.The parameters of the thermal power units and load demand data in each region are available in [32],and the upper limit of the tieline power transmission is 500 MW.The penalty cost of abandoned wind is 100 $/MW,the electricity price of the tieline γ=20 $/MW,ΓS =4,and ΓT=12.
With respect to parameter setting for the Qlearning algorithm,it is supposed that the learning factor α =0.01 and the discount factor β =0.8.For state space division,the load power is divided into 16 discrete spaces at intervals of 50 MW,and the windpower output is divided into six discrete spaces at intervals of 50 MW.Therefore,corresponding to a 24h dispatch period,regions Iand Jcorrespond to 2304 and 384 states,respectively.For action space division,the operation of the units is divided into two fixed states,start or stop,and the tieline power value is divided into {0,100,200,300,400,500} six fixed values.Therefore,regions I and Jcontain 6144 actions.Utilizing the annual historical load and windpower data,the NashQ learning model is prelearned,and a Qtable approximating the optimal solution is established,which gives Qlearning a higher decisionmaking ability.The range of the regional cost is $ 5.47–6.23 million and $5.51–6.35 million,respectively,in regions Iand Jduring the prelearning stage.
Fig.3 Prediction value,and upper and lower bounds of the windpower output
4.2 Impact of tieline dispatch on the system based on NashQ learning
To better compare the analysis results,we designed the following three cases:
Case 1: The tieline power at any time,and the interconnected system is divided into two isolated systems in which economic dispatch is implemented,respectively.
Case 2:tη =0,which allows the tieline power to be adjusted at any time.The traditional method is used to solve the interconnected dispatch model.
Case 3:tη =0,and the NashQ learning method proposed in this study is applied to solve the interconnected dispatch model in which Qlearning has been prelearned.
The total cost of the three cases is shown in Table1.
Table1 Cost of each region for Cases 1–3
Case fI/$ fJ/$ Total costs/$Case 1 562337 585293 1147630 Case 2 551394 571358 1122752 Case 3 555391 569163 1124554
1) Analysis of Cases 3 and 1
Fig.4 Unit combinations in different cases
In Cases 3 and 1,units 1–4 in both regions are always in the startup group at any time,and the states of units 5–10 are different,as shown in Figs.4(a) and 4(b).The total system dispatch cost in Case 3 is 1124554 $,which is 23076$ lesser than that of Case 1.It can also be observed that there is a decrease in the dispatch costs of both regions in Case 3.Comparing Figs.4(a) and 4(b),the number of startups and running time in Case 3 are reduced in the two regions,indicating that the interconnected system can not only reduce the operating cost but also promote the consumption of wind power.
2) Analysis of Cases 3 and 2
In Cases3 and 2,the startup/shutdown of the units in the two regions differ for units 5–7,as depicted in Figs.4(b) and 4(c).Therefore,on comparing the startup/shutdown status and the equivalent cost of each region between Cases 2 and 3,it can be observed that the number of startups in region Iand the runtime of the units increase,and the number of startups in region Jand the runtime of the units reduce in Case 3; correspondingly,the equivalent cost of region Jdecreases,whereas that of the region Iincreases.This demonstrates that NashQ learning can redistribute the economic benefits of the two regions again through iterative solutions to enable more windpower consumption.
The transmission power of the tieline in Cases 3 and 2 and the wind power dispatch are displayed in Fig.5,where the total system dispatching cost in Case 2 is 1122752 $,and the total cost in Case 3 is 0.16% more than that of Case 2.From Fig.5,it can be calculated that the wind power consumption in Case 3 increases by 1.76% compared to Case 2.Moreover,with respect to Case 2,there is a decrease in the tieline power value in the peak load periods of 10–13h and 20–21 h in Case 3; the tieline power value reduces in the two lowload periods of 1–5h and 23–24 h.In both cases,the wind power can be completely consumed in the 6–22h period.In the two load periods of 1–5h and 23–24 h,there is windpower curtailment in both cases; however,the windpower consumption capacity in Case 3 is higher.This demonstrates that NashQ learning can alleviate the peak pressure in the peak load period by coordinating the interests between the two regions,ensuring the benefit of the windpower receiving end region Jin the lowload period,and promoting the consumption of wind power,but at the expense of the overall economics of the system.
Fig.5 Comparison of the tieline power and windpower dispatch output between Cases 2 and 3
4.3 Influence of γ and ηt on the economics of interconnected systems
This study analyzes the influence of the control coefficient ηt on the windpower adjustment period and the tieline electricity price γ on the economics.For ηt =0 and γ ranging from 0–25 $/MW,the equivalent and total cost of each region are listed in Table2.It can be seen that as the tieline electricity price γ increases,the equivalent cost of region J gradually increases compared to region I; correspondingly,there is a deterioration in the overall economy and windpower consumption.For γ =20 $/MW and ηt ranging from 0–1,the changes in the total cost are presented in Table3.It can be seen that with the increase in ηt,the timeperiod in which the wind power can be shared between the regions decreases,deteriorating the windpower consumption and the overall economy.
Table2 Influence of γ on the economic dispatch of interconnected systems
γ fI/$ fJ/$ Total costs/$0 557844 565654 1123498 5 556789 567139 1123928 10 555713 568555 1124268 15 555539 568762 1124301 20 555391 569163 1124554 25 552083 573735 1125818
Table3 Influence of ηt on the economic dispatch of interconnected systems
tη 0 0.3 0.6 0.9 1 Total costs/$ 1124554 1125280 1126583 1127824 1128030
4.4 Algorithm performance comparison
The discrete particle swarm optimization (DPSO) algorithm is used to solve the optimal dispatch strategy by optimizing the unit output to obtain the equivalent cost of each region.For the overall game process,the Nash equilibrium point is solved using the iterative search method.The solution procedures are different from those of the Qlearning method.The same time section is used for 10 repeated calculations in both algorithms.Four indicators are used for comparison,including the mean value of the objective function,variance (D),standard deviation (SD),and relative standard deviation (RSD) of each algorithm,and the results are shown in Table4.The stability of the solution method proposed in this study is slightly better than that of the DPSO,and the objective function value based on particle swarm optimization has a relatively uniform distribution,increasing the uncertainty of the result.
Table4 Optimization results of the two algorithms
Algorithm Mean value of objective function/$ D SD RSD DPSO 1138281 0.2964 0.5444 0.0510 Qlearning 1124393 0.0102 0.1427 0.0135
The iterative convergence process is illustrated in Fig.6.The DPSO algorithm converges to the optimal value with 1,200 iterations,whereas the Qlearning algorithm after prelearning can converge to the optimal cost value in approximately 600 iterations.Besides,the initial total cost of Qlearning is approximately 30% less than the DPSO,and it is also more economical after convergence.This shows that after prelearning,the Qlearning algorithm has reasonable ability to make an optimal decision based on the learned and accumulated experience; hence,its initial solution is close to the optimal value.Compared to the DPSO heuristic algorithm,the Qlearning algorithm has a more efficient solution speed and optimal decision solution for multivariate highdimensional complex scheduling optimization problems.
Fig.6 Comparison of the iterative convergence between the Qlearning and DPSO algorithms
5 Conclusions
In this study,a coordinated economic dispatch model,considering the interconnected windpower uncertainty,was presented,which integrates and exploits the synergy between game theory and reinforcement learning algorithms.The rationality and feasibility of this model in dispatch decisionmaking was analyzed in detail and discussed.After verification through a calculation example,the following conclusions were drawn:
1) The economic dispatch of interconnected systems based on the NashQ learning algorithm can not only effectively deal with the uncertainty of the windpower output and improve windpower consumption but also redistribute the shared benefits between regions.
2) The power line parameter γ of the tieline has significant effect on the dispatch cost of the interconnected system.The larger the value of γ,the higher is the cost of purchasing wind power from the sending end,and the lower is the wind power consumption.The larger the value of ηt,the lesser is the wind power shared between the interconnected systems in each period,and the overall economics worsen.
3) The prelearning Qlearning algorithm has better convergence and computational efficiency than the intelligent algorithm.
Further research can focus on improving the generalization of the proposed method and online transfer learning can be used for application to various scenarios.
Acknowledgements
This work is supported by the Fundamental Research Funds For the Central Universities (No.2017MS093).
Appendix A
Fig.A1 Interconnection structure of the modified IEEE39 node system
References

[1]
Shu Y,Zhang Z,Guo J,Zhang Z (2017) Study on key factors and solution of renewable energy accommodation.Proceedings of the CSEE 37(01):19 [百度学术]

[2]
Cui X,Zou C,Wang H,Zhou B (2018) Source and load coordinative optimal dispatching of combined heat and power system considering wind power accommodation.Electric Power Automation Equipment 38(07):7481 [百度学术]

[3]
Zhang Q,Wang X,Yang T,Ren J,Zhang X (2017) A robust dispatch method for power grid with wind farms.Power Syst Technol 41(5):14511459 [百度学术]

[4]
Liu W,Wen J,Xie C,Wang W,Liu F (2014) Multiobjective fuzzy optimal dispatch based on sourceload interaction for power system with wind farm.Electric Power Automation Equipment 34(10):5663 [百度学术]

[5]
Fu Z,Li X,Yuan Y (2019) Research on technologies of ubiquitous power internet of things.Electric Power Construction 40(05):112 [百度学术]

[6]
Han J,Yi G,Xu P,Li J (2019) Study of future power interconnection scheme in ASEAN.Global Energy Interconnection 2(6):548558 [百度学术]

[7]
Kåberger T (2018) Progress of renewable electricity replacingfossil fuels.Global Energy Interconnection 1(1):4852 [百度学术]

[8]
Voropai N,Podkovalnikov S,Chudinova L,Letova K (2019) Development of electric power cooperation in Northeast Asia.Global Energy Interconnection 2(1):16 [百度学术]

[9]
Ren J,Xu Y (2018) Decentralized coordinated scheduling model of interconnected power systems considering wind power uncertainty.Automation of Electric Power Systems 42(16): 4147+160+201205 [百度学术]

[10]
Ren J,Xu Y,Dong S (2018) A decentralized scheduling model with energy storage participation for interconnected power system with high wind power penetration.Power Syst Technol 42(4):10791086 [百度学术]

[11]
Zhao W,Liu M,Zhu J,Li L (2016) Fully decentralised multiarea dynamic economic dispatch for largescale power systems via cutting plane consensus.IET GENERTRANSMDIS 10(10): 24862495 [百度学术]

[12]
Li Z,Wu W,Zhang B,Wang B (2015) Decentralized multiarea dynamic economic dispatch using modified generalized benders decomposition.IEEE T Power Syst 31(1):526538 [百度学术]

[13]
Lyu K,Tang H,Wang K,Tang B,Wu H (2019) Coordinated dispatching of sourcegridload for interregional power grid considering uncertainties of both source and load sides.Automation of Electric Power Systems 43(22):3845 [百度学术]

[14]
Moris P (1994) Introduction to game theory.SpringerVerlag,New York [百度学术]

[15]
Richard S,Andrew G(1998) Reinforcement Learning: An Introduction.MIT press,Cambridge [百度学术]

[16]
Lu Q,Chen L,Mei S (2014) Typical applications and prospects of game theory in power system.Proceedings of the CSEE 34(29): 50095017 [百度学术]

[17]
Cheng L,Yu T,Zhang X,Yin L (2019) Machine learning for energy and electric power systems: state of the art and prospects.Automation of Electric Power Systems 43(1):1543 [百度学术]

[18]
Yamada K,Takano S (2013) A Reinforcement learning approach using reliability for multiagent systems.The Society of Instrument and Control Engineers 49(1):3947 [百度学术]

[19]
Huang Q,Uchibe E,Doya K (2016) Emergence of communication among reinforcement learning agents under coordination environment.2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDLEpiRob),CergyPontoise,1922 Sept 2016 [百度学术]

[20]
Liu G,Han X,Wang S,Yang M,Wang M (2016) Optimal decisionmaking in the cooperation of wind power and energy storage based on reinforcement learning algorithm.Power Syst Technol 40(9):27292736 [百度学术]

[21]
Liu H,Li J,Ge S,Zhang P,Chen X (2019) Coordinated scheduling of gridconnected integrated energy microgrid based on multiagent game and reinforcement learning.Automation of Electric Power Systems 43(01): 4050 [百度学术]

[22]
Wei W,Liu F,Mei S (2013) Robust and economical scheduling methodology for power systems:part two application examples.Automation of Electric Power Systems 37(18):6067 [百度学术]

[23]
Carrion M,Arroyo JM (2006) A computationally efficient mixedinteger linear formulation for the thermal unit commitment problem.IEEE T Power Syst 21(3): 13711378 [百度学术]

[24]
Wang Y,Wang X,Shao C,Gong N (2020) Distributed energy trading for an integrated energy system and electric vehicle charging stations: A Nash bargaining game approach.Renew Energy 155:513530 [百度学术]

[25]
Chuang A,Wu F,Varaiya P (2001) A gametheoretic model for generation expansion planning: problem formulation and numerical comparisons.IEEE T Power Syst 16(4):885891 [百度学术]

[26]
Wu X,Tang Z,Xu Q,Zhou Y (2020) Qlearning algorithm based method for enhancing resiliency of integrated energy system.Electric Power Automation Equipment 40(04):146152 [百度学术]

[27]
Amato C,Chowdhary G,Geramifard A,Ure N,Kochenderfer M (2013) Decentralized control of partially observable Markov decision processes.52nd IEEE Conference on Decision and Control,Florence,Italy,1013 Dec2013 [百度学术]

[28]
Araabi B,Mastoureshgh S,Ahmadabadi M (2007) A study on expertise of agents and its effects on cooperative Qlearning.IEEE T Syst ManCy B 37:398409 [百度学术]

[29]
Hu J,Michael P W (2003) Nash Qlearning for GeneralSum Stochastic Game.J Mach Learn Res 4: 10391069 [百度学术]

[30]
Wu X,Tang Z,Xu Q,Zhou Y (2020) Qlearning algorithm based method for enhancing resiliency of integrated energy system.Electric power automation equipment40(04):146152 [百度学术]

[31]
Liu J,Ke Z,Zhou W (2020) Energy dispatch strategy and control optimization of microgrid based on reinforcement learning.Journal of Beijing University of Posts and Telecommunications 43(01): 2834 [百度学术]

[32]
Ongsakul W,Petcharaks N (2004) Unit commitment by enhanced adaptive Lagrangian relaxation.IEEE T Power Syst 19(1):620628 [百度学术]
Fund Information
supported by the Fundamental Research Funds For the Central Universities (No. 2017MS093)；
supported by the Fundamental Research Funds For the Central Universities (No. 2017MS093)；