Research Article | | Peer-Reviewed

Use of Reinforcement Learning to Gain the Nash Equilibrium

Received: 29 August 2025     Accepted: 13 October 2025     Published: 31 October 2025
Views:       Downloads:
Abstract

Reinforcement learning (RL) is a type of machine learning where an agent learns optimal behavior through interaction with its environment. It is a machine learning training method that trains software to make certain desired actions. Nash equilibrium (SNE) is a combination of actions of the different players, in which no coalition of players can cooperatively deviate. Each player chooses the best strategy among all options. Nash equilibrium occurs when each player knows the strategy of their opponent and uses that knowledge. Nash equilibrium occurs in non-cooperative games when two players have optimal game strategies such that no matter how they change their strategy. This paper explores the application of reinforcement learning algorithms within the domain of game theory, with a particular focus on their convergence properties toward Nash equilibrium. We analyze q-learning approach in 2-agent environments, highlighting their capacity to learn optimal strategies through iterative interactions. Our theoretical investigation examines the conditions under which these algorithms converge to Nash equilibrium, considering factors such as learning rate schedules. The insights gained contribute to a deeper understanding of how reinforcement learning can serve as a powerful tool for equilibrium computation in complex strategic environments, paving the way for advanced applications in economics, automated negotiations, and autonomous systems.

Published in Mathematics Letters (Volume 11, Issue 3)
DOI 10.11648/j.ml.20251103.12
Page(s) 66-70
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Q-learning, Nash Equilibrium, Game Theory, Reinforcement Learning

1. Introduction
In the field of game theory, learning in repeated games refers to the process through which players adapt their strategies over multiple iterations of a game based on past experiences and outcomes. In repeated games, players engage in the same game multiple times, allowing them to observe the actions of their opponents and modify their own strategies accordingly. Learning in repeated games contains the strategy adjustment which means players may change their strategies based on the results of previous rounds. This could involve being more cooperative if they perceive that it leads to better overall outcomes or becoming more competitive if they feel that cooperation is being exploited.
Some other concepts are reputation effects, evolutionary learning, folk theorem and communication and coordination. Overall, learning in repeated games emphasizes the dynamic nature of strategic decision-making, where past interactions inform future choices, and players continuously adapt to the behaviors of others, . The learning algorithms are too important and indicate that players might use specific algorithms or heuristics to guide their decisions, such as fictitious play. One of these algorithms is the reinforcement learning. It is a type of machine learning where an agent learns to make decisions by interacting with an environment.
The goal of the agent is to maximize a cumulative reward over time through trial and error. It has been successfully applied in various fields, including robotics, recommendation systems, and autonomous vehicles, to name a few. It’s particularly powerful in situations where the optimal solution is difficult to define or where the agent must balance exploration (trying new things) and exploitation (using known strategies), see and references therein. Reinforcement learning contains some components like agent, environment, state, action, reward feedback, and value function. The core process of reinforcement learning involves the agent observing the current state of the environment, taking an action based on its policy, receiving a reward, and then updating its policy to improve future actions. This cycle continues, allowing the agent to learn and adapt over time, see .
Various types of reinforcement learning approaches are applied to repeated games. Most of them are presented as multi-agent reinforcement learning setting; see and references therein. To review a few, authors studied the presence of multiple Nash equilibrium in general sum games and proposed the q-learning method and studied its convergence to Nash equilibriums. Authors introduced the individual q-learning reinforcement learning algorithm and using the stochastic approximation technique studied its asymptotic behavior, showed that strategies will converge to Nash equilibriums almost surely in almost all 2-player games. Author applied the reinforcement learning applications in poker game which is a multi-player extensive form game with incomplete information. For comprehensive review in this field, see and .
Here, we use the notations and results of for two-player game where payoffs (rewards) are presented in m*m bi-matrix. First, suppose that m=2 and consider the following matrix
Table 1. Payoff Matrix.

L

R

L

(a11,b11)

(a12,b12)

R

(a21,b21)

(a22,b22)

Let αt (βt) be the probability of choosing L by raw (column) player; i.e., player 1 (player 2). Let
π1t=(αt,1-αt)T
π2t=(βt,1-βt)T
where t=tn[0,), n=1,,N.
The payoff (reward) matrix of players 1 and 2 are A, B, respectively,
A=a11a12a21a22,B=b11b12b21b22,
The rest of paper is organized as follows. The q-learning procedure is given in the next section. The matching the pennies game is studied in section 4. Section 4 concludes.
2. Q-learning Procedure
Following , let
qit=(qiLt,qiRt)T,i=1, 2
be the vector of q- value functions for i-th player, i=1, 2, at time t. Following their Lemma 1, it is seen that
dq1Ldt=a11βt+a121-βt-q1L
dq1Rdt=a21βt+a221-βt-q1R
In the matrix form, it is seen that
ddtq1t=Aπ2t-q1t.
Similarly, it is seen that
ddtq2t=Bπ1t-q2t.
It is easy to see that these results are true for m>2. The above equation has following solution as
q1t=A0te-(t-s)π2sds+q10e-t,
where assuming q10=0, then
q1t=A0te-(t-s)π2sds.
Similarly, assuming q20=0, then
q2t=B0te-(t-s)π1sds.
Following , for updating procedure, let
αn=αtn
and τ be the temperature parameter. For m=2, using the Boltzmann action selection and representing it, in the logit form, it is seen that
logitαn+1logαn+11-αn+1=1τ1Tq1tn,
logitβn+1=1τ1Tq2tn,
where 1T=1-1. Then,
logitαn+1=1τ1TA0tne-(tn-s)π2sds,
logitβn+1=1τ1TB0tne-(tn-s)π1sds.
The following proposition summarizes the above discussion:
Proposition 1. For two-player game with payoff matrices A, B, for player 1 and 2, respectively, then the updating procedure for αn, βn are given by
logitαn+1=1τ1TA0tne-(tn-s)π2sds,
logitβn+1=1τ1TB0tne-(tn-s)π1sds.
3. Matching the Pennies
For the matching the pennies, we have A=-B=1-1-11, 1TA=-1TB=21T. Then,
logitαn+1=2τ0tne-(tn-s)1Tπ2sds
=2τ0tne-tn-s2βs-1ds,
logitβn+1=2τ0tne-tn-s1-2αsds.
3.1. Nash Convergence
Hereafter, the convergence to Nash equilibrium is studied. To this end, let
θn=logitαn.
It is easy to see that
etnθn+1-etn-1θn=2τtn-1tnes|2βs-1|ds
2τtn-1tnesds=2τetn-etn-1.
Let δn=tn-tn-1>0. Thus,
θn+1-e-δnθn2τ1-e-δn.
Therefore,
θn+1-θn+(1-e-δn)θn2τ1-e-δn.
As δn tends to zero, and 1-e-δnτ2, then
θn+1-θn0.
Since the logit is a continous function, thus
αn+1-αn0.
The following proposition summarizes the above discussion where proof is omitted.
Proposition 2. Assuming 1-e-δnτ2, then as δn tends to zero, then
αn+1-αn0.
3.2. Recursive Relation
Here, using simplifying assumption, computational complexity is derived. Assuming
βs=βj-1fors[tj-1,tj),
then
θn+1=2τj=1ntj-1tjes(2βj-1-1)ds=
=2τj=1n2βj-1-1etj-etj-1.
Assuming tj-tj-1=δn, jn, then
θn+1=e-δnθn+(1-e-δn)τ(βn-1-0.5).
The following proposition generalizes the above discussion:
Proposition 3. For general 2*2 payoff bi-matrix A, then
θn+1=e-δnθn+(1-e-δn)τ(l1βn-1+l2),
where
l1=a11+a22-(a12+a21)
l2=a12-a22.
Remark 1. Notice that
|θn+1-θn=1-e-δnl1βn-1+l2τ-θn|,
where assuming 1-e-δnτ2, again |θn+1-θn|0 is derived. Let ζn=logit(βn). Also, notice that
2βn-1-1=eζn-1-1eζn-1+1.
3.3. Simulations
For matching the pennies game, let δn=0.1, τ=0.2, α0=0.3, β0.=0.8. The blue and red lines are αn, βn, respectively.
Figure 1. Convergences of αn, βn.
Remark 2. For example, for the prisoner's dilemma game with A=-2-10-1-5, then, l1=4, l2=-5 and
θn+1=e-δnθn+(1-e-δn)τ(4βn-1-5).
3.4. Comparisons
Here, the rate of convergence of reinforcement learning and gradient descent methods are compared in the matching the pennies game. The gradient descent sequential updating procedure is given by
αn+1-αn=4βn-0.5ζ,
where ζ is the step parameter, see . For the reinforcement learning, it is seen that
|θn+1-θn|=(1-e-δn)|4τ(βn-0.5)-θn|,
4βn-0.51-e-δnτ.
Since the logit function is a one to one function, therefore, it is enough to compare ζ and 1-e-δnτ. As
τζ1-e-δn,
then, the gradient descent method has faster rate of convergence.
4. Conclusions
In conclusion, the exploration of reinforcement learning (RL) within the framework of game theory has yielded promising advancements in both fields. RL algorithms, with their ability to learn optimal strategies through trial and error, provide a powerful tool for addressing the complexities inherent in multi-agent environments. This paper has showcased the diverse applications of RL in game theory, from approximating Nash equilibrium in complex games to designing adaptive agents capable of outperforming traditional game-theoretic approaches.
However, the integration of RL and game theory is not without its challenges. The non-stationary of multi-agent environments, the curse of dimensionality in large games, and the difficulty in ensuring convergence to desirable equilibrium concepts remain significant hurdles. Future research should focus on developing more robust and scalable RL algorithms specifically tailored for game-theoretic settings. This includes exploring techniques such as multi-agent learning with communication, hierarchical RL for complex strategy spaces, and the development of novel exploration-exploitation strategies designed to navigate the challenges of interacting with dynamically changing opponents.
Ultimately, the synergy between reinforcement learning and game theory holds immense potential for understanding and influencing strategic interactions in a wide range of real-world domains, including economics, robotics, and social science. By continuing to refine and expand the theoretical foundations and practical applications of this interdisciplinary field, we can unlock new possibilities for creating intelligent, cooperative, and adaptive systems capable of thriving in complex multi-agent environments. This ongoing research will not only advance our understanding of strategic decision-making but also pave the way for the development of more intelligent and autonomous agents capable of navigating the complexities of the world around us.
Abbreviations

Logit

Logit Function

Author Contributions
Reza Habibi is the sole author. The author read and approved the final manuscript.
Conflicts of Interest
The author declares no conflicts of Interest.
References
[1] Ballard, D. and Zhu, S. (2022). Overcoming non-stationary in un-communication learning. In: International Conference on Machine Learning 2002, 354-363.
[2] Brown, N. and Sandholm, T. (2017). Dynamic threshold and pruning for regret minimization. In: International Conference in Machine Learning 2017, 793-802.
[3] Collin-Dufresne, P. and Fos, V. (2012). Insider trading, stochastic liquidity and equilibrium prices. Technical report. Columbia University and University of Illinois.
[4] Cristofol, M. and Roques, L. (2016). Simultaneous determination of the drift and diffusion coefficients in stochastic differential equations. Technical report. Institut de Math ematiques de Marseille, France.
[5] Gyungmin, P. (2022). Insider trading, stock volatility and market liquidity in the Korean capital market. Studies in Business and Economics 17, 175-189.
[6] Harris, L. (1998). Optimal dynamic trading in the presence of insider information. Journal of Financial Markets 1, 123-148.
[7] Kyle, A. S. (1985). Continuous auctions and insider trading. Economterica 53, 1315-1335.
[8] Leslie, D. S., and Collins, E. J. (2025). Individual q-learning in normal form games. SIAM Journal on Control and Optimization 44, 1-20.
[9] Mailath, G. & Samuelson, L. (2016). Repeated games and reputations: long-run relationships. Oxford University Press. USA.
[10] Morris, S. and Shin, H. S. (1998). Unique equilibrium in a model of self-fulfilling currency attacks. American Economic Review 88, 587–597.
[11] Nguyen, T. T., Nguyen, N. D., Nahavandi, S. (2020). Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications. IEEE Transactions on Cybernetics 20, 3826-3839.
[12] Osborne, M. and A. Rubinstein, A. (1994). A course in game theory. Cambridge. MIT Press.
[13] Rahman, M., & Mollah, M. (2019). Mathematical modeling of insider trading: a game theoretical approach. Journal of Risk and Financial Management 12, 138- 158.
[14] Singh, S., Kearns, M., and Mansour, Y. (2013). Nash convergence of gradient dynamics in general-sum games. Technical Report. AT&T Labs and Tel Aviv University.
[15] Sutton, R. S., and Barto, A. G. (2009). Reinforcement learning: an introduction. MIT Press. Cambridge. UK.
Cite This Article
  • APA Style

    Habibi, R. (2025). Use of Reinforcement Learning to Gain the Nash Equilibrium. Mathematics Letters, 11(3), 66-70. https://doi.org/10.11648/j.ml.20251103.12

    Copy | Download

    ACS Style

    Habibi, R. Use of Reinforcement Learning to Gain the Nash Equilibrium. Math. Lett. 2025, 11(3), 66-70. doi: 10.11648/j.ml.20251103.12

    Copy | Download

    AMA Style

    Habibi R. Use of Reinforcement Learning to Gain the Nash Equilibrium. Math Lett. 2025;11(3):66-70. doi: 10.11648/j.ml.20251103.12

    Copy | Download

  • @article{10.11648/j.ml.20251103.12,
      author = {Reza Habibi},
      title = {Use of Reinforcement Learning to Gain the Nash Equilibrium
    },
      journal = {Mathematics Letters},
      volume = {11},
      number = {3},
      pages = {66-70},
      doi = {10.11648/j.ml.20251103.12},
      url = {https://doi.org/10.11648/j.ml.20251103.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ml.20251103.12},
      abstract = {Reinforcement learning (RL) is a type of machine learning where an agent learns optimal behavior through interaction with its environment. It is a machine learning training method that trains software to make certain desired actions. Nash equilibrium (SNE) is a combination of actions of the different players, in which no coalition of players can cooperatively deviate. Each player chooses the best strategy among all options. Nash equilibrium occurs when each player knows the strategy of their opponent and uses that knowledge. Nash equilibrium occurs in non-cooperative games when two players have optimal game strategies such that no matter how they change their strategy. This paper explores the application of reinforcement learning algorithms within the domain of game theory, with a particular focus on their convergence properties toward Nash equilibrium. We analyze q-learning approach in 2-agent environments, highlighting their capacity to learn optimal strategies through iterative interactions. Our theoretical investigation examines the conditions under which these algorithms converge to Nash equilibrium, considering factors such as learning rate schedules. The insights gained contribute to a deeper understanding of how reinforcement learning can serve as a powerful tool for equilibrium computation in complex strategic environments, paving the way for advanced applications in economics, automated negotiations, and autonomous systems.
    },
     year = {2025}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Use of Reinforcement Learning to Gain the Nash Equilibrium
    
    AU  - Reza Habibi
    Y1  - 2025/10/31
    PY  - 2025
    N1  - https://doi.org/10.11648/j.ml.20251103.12
    DO  - 10.11648/j.ml.20251103.12
    T2  - Mathematics Letters
    JF  - Mathematics Letters
    JO  - Mathematics Letters
    SP  - 66
    EP  - 70
    PB  - Science Publishing Group
    SN  - 2575-5056
    UR  - https://doi.org/10.11648/j.ml.20251103.12
    AB  - Reinforcement learning (RL) is a type of machine learning where an agent learns optimal behavior through interaction with its environment. It is a machine learning training method that trains software to make certain desired actions. Nash equilibrium (SNE) is a combination of actions of the different players, in which no coalition of players can cooperatively deviate. Each player chooses the best strategy among all options. Nash equilibrium occurs when each player knows the strategy of their opponent and uses that knowledge. Nash equilibrium occurs in non-cooperative games when two players have optimal game strategies such that no matter how they change their strategy. This paper explores the application of reinforcement learning algorithms within the domain of game theory, with a particular focus on their convergence properties toward Nash equilibrium. We analyze q-learning approach in 2-agent environments, highlighting their capacity to learn optimal strategies through iterative interactions. Our theoretical investigation examines the conditions under which these algorithms converge to Nash equilibrium, considering factors such as learning rate schedules. The insights gained contribute to a deeper understanding of how reinforcement learning can serve as a powerful tool for equilibrium computation in complex strategic environments, paving the way for advanced applications in economics, automated negotiations, and autonomous systems.
    
    VL  - 11
    IS  - 3
    ER  - 

    Copy | Download

Author Information