Research Article | | Peer-Reviewed

Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms

Received: 8 January 2026     Accepted: 6 February 2026     Published: 26 February 2026
Views:       Downloads:
Abstract

Modern AI systems increasingly operate in economic environments such as markets and insurance, where data, behavior, and incentives are endogenous and mutually reinforcing. This paper develops a microeconomic foundation for multi-agent learning by studying a repeated principal-agent interaction in a finite-horizon Markov decision process with strategic externalities, where both the principal and the agent learn over time and the agent’s actions affect payoffs and the environment’s dynamics. We design incentive schemes that align decentralized learning with social welfare via a two-phase mechanism: in Phase 1, the principal estimates the minimal transfers required to implement targeted actions by identifying how incentives shift the agent’s effective preferences; in Phase 2, the principal uses these estimates to steer long-run state-action visitation toward welfare-optimal behavior. We evaluate performance using social-welfare regret relative to the best achievable welfare benchmark, and we show that the mechanism achieves sublinear social-welfare regret under mild conditions (sublinear agent regret and sufficient exploration/coverage), implying asymptotically optimal welfare despite endogenous externalities and simultaneous learning. Simulations in a simple pollution-control environment illustrate that even coarse incentives can correct inefficient learning outcomes and substantially improve welfare. These results underscore that incentive-aware design, grounded in contract theory and mechanism design, is essential for safe, welfare-aligned AI deployed in strategic economic systems.

Published in American Journal of Artificial Intelligence (Volume 10, Issue 1)
DOI 10.11648/j.ajai.20261001.20
Page(s) 101-113
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2026. Published by Science Publishing Group

Keywords

Artificial Intelligence, Machine Learning, Multi-Agent Systems, Computational Game Theory

References
[1] David M Rothschild, Markus Mobius, Jake M Hofman, Eleanor W Dillon, Daniel G Goldstein, Nicole Immorlica, Sonia Jaffe, Brendan Lucier, Aleksandrs Slivkins, and Matthew Vogel. The agentic economy. arXiv preprint arXiv: 2505.15799, 2025.
[2] Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Generative ai as economic agents. ACM SIGecom Exchanges, 22(1): 93–109, 2024.
[3] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4: 237–285, 1996.
[4] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning. Journal of Cognitive Neuroscience, 11(1): 126–134, 1999.
[5] Roger Guesnerie and Jean-Jacques Laffont. A complete solution to a class of principal-agent probems with an application to the control of a self-managed firm. Journal of public Economics, 25(3): 329–369, 1984.
[6] Patrick Bolton and Mathias Dewatripont. Contract theory. MIT press, 2004.
[7] Botond Kőszegi. Behavioral contract theory. Journal of Economic Literature, 52(4): 1075–1118, 2014.
[8] Eden Saig, Inbal Talgam-Cohen, and Nir Rosenfeld. Delegated classification. Advances in Neural Information Processing Systems, 36: 13200–13236, 2023.
[9] Jean-Jacques Laffont and David Martimort. The theory of incentives: the principal-agent model. Princeton university press, 2002.
[10] Nivasini Ananthakrishnan, Stephen Bates, Michael Jordan, and Nika Haghtalab. Delegating data collection in decentralized machine learning. In International Conference on Artificial Intelligence and Statistics, pages 478–486. PMLR, 2024.
[11] Max Simchowitz and Aleksandrs Slivkins. Exploration and incentives in reinforcement learning. Operations Research, 72(3): 983–998, 2024.
[12] Antoine Scheid, Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael Jordan, and Alain Durmus. Learning to mitigate externalities: the coase theorem with hindsight rationality. Advances in Neural Information Processing Systems, 37: 15149–15183, 2024.
[13] Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
[14] Martin L Puterman. Markov decision processes. Handbooks in operations research and management science, 2: 331–434, 1990.
[15] Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
[16] Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. Advances in Neural Information Processing Systems, 33: 4929–4939, 2020.
[17] Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative prediction in a stateful world. In International conference on artificial intelligence and statistics, pages 6045–6061. PMLR, 2022.
[18] James A Mirrlees. The theory of moral hazard and unobservable behaviour: Part i. The Review of Economic Studies, 66(1): 3–21, 1999.
[19] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Combinatorial contracts. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 815–826. IEEE, 2022.
[20] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Multi-agent contracts. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1311–1324, 2023.
[21] Paul Dütting, Michal Feldman, Tomasz Ponitka, and Ermis Soumalias. The pseudo-dimension of contracts. In Proceedings of the 26th ACM Conference on Economics and Computation, pages 514–539, 2025.
[22] Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, Eric Moulines, Michael Jordan, El-Mahdi El-Mhamdi, and Alain Oliviero Durmus. Incentivized learning in principal-agent bandit games. In International Conference on Machine Learning, pages 43608–43631. PMLR, 2024.
[23] Junyan Liu, Arnab Maiti, Artin Tajdini, Kevin Jamieson, and Lillian J Ratliff. Learning to incentivize in repeated principal-agent problems with adversarial agent arrivals. arXiv preprint arXiv: 2505.23124, 2025.
[24] Antoine Scheid, Etienne Boursier, Alain Durmus, Eric Moulines, and Michael I Jordan. Online decision-making in tree-like multi-agent games with transfers. arXiv preprint arXiv: 2501.19388, 2025.
[25] Antoine Scheid, Etienne Boursier, Alain Durmus, Eric Moulines, and Michael I Jordan. Learning contracts in hierarchical multi-agent systems. arXiv preprint arXiv: 2501.19388v1, 2025.
[26] Junyan Liu and Lillian J Ratliff. Principal-agent bandit games with self-interested and exploratory learning agents. arXiv preprint arXiv: 2412.16318, 2024.
[27] Yuchen Wu, Xinyi Zhong, and Zhuoran Yang. Learning to lead: Incentivizing strategic agents in the dark. arXiv preprint arXiv: 2506.08438, 2025.
[28] Aymeric Capitaine, Etienne Boursier, Antoine Scheid, Eric Moulines, Michael I Jordan, El-Mahdi El-Mhamdi, and Alain Durmus. Unravelling in collaborative learning. arXiv preprint arXiv: 2407.14332, 2024.
[29] Ronald Harry Coase. The problem of social cost. The journal of Law and Economics, 56(4): 837–877, 2013.
[30] Steven G Medema. The coase theorem at sixty. Journal of Economic Literature, 58(4): 1045–1128, 2020.
[31] Joseph Farrell. Information and the coase theorem. Journal of Economic Perspectives, 1(2): 113–129, 1987.
[32] Tatyana Deryugina, Frances Moore, and Richard SJ Tol. Environmental applications of the coase theorem. Environmental Science & Policy, 120: 81–88, 2021.
[33] Shiliang Zuo. New perspectives in online contract design: Heterogeneous, homogeneous, non-myopic agents and team production. arXiv preprint arXiv: 2403.07143, 2024.
[34] Jakub Tłuczek, Victor Villin, and Christos Dimitrakakis. Fair contracts in principal-agent games with heterogeneous types. arXiv preprint arXiv: 2506.15887, 2025.
[35] Dima Ivanov, Paul Dütting, Inbal Talgam-Cohen, Tonghan Wang, and David C Parkes. Principal-agent reinforcement learning: Orchestrating ai agents with contracts. arXiv preprint arXiv: 2407.18074, 2024.
[36] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
[37] Ann Nowé, Peter Vrancx, and Yann-Michaël De Hauwere. Game theory and multi-agent reinforcement learning. In Reinforcement learning: State-of-the-art, pages 441–470. Springer, 2012.
[38] Kaiqing Zhang, Zhuoran Yang, and Tamer Bas¸ar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pages 321–384, 2021.
[39] Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv: 2011.00583, 2020.
[40] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese journal of mathematics, 2(1): 229–260, 2007.
[41] Alain Bensoussan, Jens Frehse, Phillip Yam, et al. Mean field games and mean field type control theory, volume 101. Springer, 2013.
[42] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. Advances in neural information processing systems, 32, 2019.
[43] Leo Widmer, Jiawei Huang, and Niao He. Steering no-regret agents in mfgs under model uncertainty. arXiv preprint arXiv: 2503.09309, 2025.
[44] Roger E Kirk. Experimental design. Sage handbook of quantitative methods in psychology, pages 23–45, 2009.
[45] Paul D Berger, Robert E Maurer, and Giovana B Celli. Experimental design. Springer, 2018.
[46] Walter F Federer. Experimental design, volume 81. LWW, 1956.
[47] Anthony C Atkinson. Optimal design. Wiley StatsRef: Statistics Reference Online, pages 1–17, 2014.
[48] Peter Goos, Bradley Jones, and Utami Syafitri. I-optimal design of mixture experiments. Journal of the American Statistical Association, 111(514): 899–911, 2016.
[49] Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv: 2401.06080, 2024.
[50] Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv: 2407.16216, 2024.
[51] Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf. arXiv preprint arXiv: 2502.18770, 2025.
[52] Heyang Zhao, Chenlu Ye, Quanquan Gu, and Tong Zhang. Sharp analysis for kl-regularized contextual bandits and rlhf. arXiv preprint arXiv: 2411.04625, 2024.
[53] Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I Jordan, Pierre Ménard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in rlhf. arXiv preprint arXiv: 2410.17055, 2024.
[54] Margaret A Meyer and John Vickers. Performance comparisons and dynamic incentives. Journal of political economy, 105(3): 547–581, 1997.
[55] Jorge Barrera and Alfredo Garcia. Dynamic incentives for congestion control. IEEE Transactions on Automatic Control, 60(2): 299–310, 2014.
[56] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436–444, 2015.
[57] Numa Dhamani and Maggie Engler. Introduction to generative AI. Simon and Schuster, 2026.
[58] Richard N Cooper. Economic interdependence and coordination of economic policies. Handbook of international economics, 2: 1195–1234, 1985.
[59] Iain M Johnstone and D Michael Titterington. Statistical challenges of high-dimensional data, 2009.
[60] Abhinay Muthoo. Bargaining theory with applications. Cambridge University Press, 1999.
[61] Robert Powell. Bargaining theory and international conflict. Annual Review of Political Science, 5(1): 1–30, 2002.
[62] Ingolf Stå hl. Bargaining theory. PhD thesis, Stockholm School of Economics, 1973.
[63] Fidan Boylu, Haldun Aytug, and Gary J Koehler. Principal–agent learning. Decision Support Systems, 47 (2): 75–81, 2009.
[64] Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, and Parikshit Ram. Quantifying artificial intelligence through algorithmic generalization. Nature Machine Intelligence, 7(8): 1195–1205, 2025.
[65] Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models. arXiv preprint arXiv: 2510.21890, 2025.
[66] Vijay Mohan. Automated market makers and decentralized exchanges: a defi primer. Financial Innovation, 8(1): 20, 2022.
[67] Lars Erik Zachrisson. Markov games, volume 52. Princeton University Press Princeton, NJ, 1964.
Cite This Article
  • APA Style

    Helou, N. (2026). Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms. American Journal of Artificial Intelligence, 10(1), 101-113. https://doi.org/10.11648/j.ajai.20261001.20

    Copy | Download

    ACS Style

    Helou, N. Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms. Am. J. Artif. Intell. 2026, 10(1), 101-113. doi: 10.11648/j.ajai.20261001.20

    Copy | Download

    AMA Style

    Helou N. Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms. Am J Artif Intell. 2026;10(1):101-113. doi: 10.11648/j.ajai.20261001.20

    Copy | Download

  • @article{10.11648/j.ajai.20261001.20,
      author = {Nassim Helou},
      title = {Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms},
      journal = {American Journal of Artificial Intelligence},
      volume = {10},
      number = {1},
      pages = {101-113},
      doi = {10.11648/j.ajai.20261001.20},
      url = {https://doi.org/10.11648/j.ajai.20261001.20},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajai.20261001.20},
      abstract = {Modern AI systems increasingly operate in economic environments such as markets and insurance, where data, behavior, and incentives are endogenous and mutually reinforcing. This paper develops a microeconomic foundation for multi-agent learning by studying a repeated principal-agent interaction in a finite-horizon Markov decision process with strategic externalities, where both the principal and the agent learn over time and the agent’s actions affect payoffs and the environment’s dynamics. We design incentive schemes that align decentralized learning with social welfare via a two-phase mechanism: in Phase 1, the principal estimates the minimal transfers required to implement targeted actions by identifying how incentives shift the agent’s effective preferences; in Phase 2, the principal uses these estimates to steer long-run state-action visitation toward welfare-optimal behavior. We evaluate performance using social-welfare regret relative to the best achievable welfare benchmark, and we show that the mechanism achieves sublinear social-welfare regret under mild conditions (sublinear agent regret and sufficient exploration/coverage), implying asymptotically optimal welfare despite endogenous externalities and simultaneous learning. Simulations in a simple pollution-control environment illustrate that even coarse incentives can correct inefficient learning outcomes and substantially improve welfare. These results underscore that incentive-aware design, grounded in contract theory and mechanism design, is essential for safe, welfare-aligned AI deployed in strategic economic systems.
    },
     year = {2026}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Incentive-Aware Learning in Stateful Environments: Internalizing Externalities in Principal-Agent MDPs and Welfare-Maximizing Diffusion Mechanisms
    AU  - Nassim Helou
    Y1  - 2026/02/26
    PY  - 2026
    N1  - https://doi.org/10.11648/j.ajai.20261001.20
    DO  - 10.11648/j.ajai.20261001.20
    T2  - American Journal of Artificial Intelligence
    JF  - American Journal of Artificial Intelligence
    JO  - American Journal of Artificial Intelligence
    SP  - 101
    EP  - 113
    PB  - Science Publishing Group
    SN  - 2639-9733
    UR  - https://doi.org/10.11648/j.ajai.20261001.20
    AB  - Modern AI systems increasingly operate in economic environments such as markets and insurance, where data, behavior, and incentives are endogenous and mutually reinforcing. This paper develops a microeconomic foundation for multi-agent learning by studying a repeated principal-agent interaction in a finite-horizon Markov decision process with strategic externalities, where both the principal and the agent learn over time and the agent’s actions affect payoffs and the environment’s dynamics. We design incentive schemes that align decentralized learning with social welfare via a two-phase mechanism: in Phase 1, the principal estimates the minimal transfers required to implement targeted actions by identifying how incentives shift the agent’s effective preferences; in Phase 2, the principal uses these estimates to steer long-run state-action visitation toward welfare-optimal behavior. We evaluate performance using social-welfare regret relative to the best achievable welfare benchmark, and we show that the mechanism achieves sublinear social-welfare regret under mild conditions (sublinear agent regret and sufficient exploration/coverage), implying asymptotically optimal welfare despite endogenous externalities and simultaneous learning. Simulations in a simple pollution-control environment illustrate that even coarse incentives can correct inefficient learning outcomes and substantially improve welfare. These results underscore that incentive-aware design, grounded in contract theory and mechanism design, is essential for safe, welfare-aligned AI deployed in strategic economic systems.
    
    VL  - 10
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Sections