The Post-Deployment Monitoring of Artificial Intelligence: Emerging Challenges in Oversight, Evaluation, and Accountability

26 Apr 2026

The Post-Deployment Monitoring of Artificial Intelligence: Emerging Challenges in Oversight, Evaluation, and Accountability

26 Apr 2026

Noor Saif Fares Al Mazrouei

Senior Researcher / Director of Artificial Intelligence and Technology Department

For most of the past decade, AI governance debates have focused on what happens before a system reaches the public, such as the way it’s designed, what data it is trained on, if it passed pre-deployment testing, and whether it clears the related regulatory or ethical review. These valid concerns assume that a system that performs acceptably in controlled conditions will continue to behave that way once it is operating in the world.

That assumption has not aged well. AI systems that are built on large language models (LLMs) or adaptive machine learning architectures do not behave as a stable object. They interact with users whose needs and behaviors shift over time. They are integrated into the workflows of institutions that were not designed with them in mind. They receive updates, fine-tuning, and configuration changes that may change their behavior in ways that are not immediately visible to the organizations deploying them. And they operate across many sectors like healthcare, finance, criminal justice, public administration, where the consequences of degraded or biased performance could be harmful on a human level.

The 2025–2026 period has brought this problem into sharper focus. The EU AI Act has dedicated an entire chapter to post-market monitoring, serious incident reporting, and market surveillance. That recognition means that legally, governance cannot end at the point of deployment.[1] Similarly, the NIST AI Risk Management Framework and its 2024 Generative AI Profile treat post-deployment monitoring as a core element of responsible AI management.[2] The OECD, through its AI Incidents Monitor and its February 2025 paper on common reporting frameworks for AI incidents, has been building the conceptual and institutional infrastructure for tracking what goes wrong with AI systems in practice.[3]

What these frameworks collectively signal is that the governance community has begun to accept that the risks associated with AI systems often only become visible after deployment. Pre-deployment testing can only identify known failure modes, but it cannot anticipate every context in which a system will be used. It cannot predict every population it will affect, or every way in which its environment will change. Post-deployment monitoring is where governance ambitions will either be achieved or quietly dissolve.

This insight examines three dimensions of that challenge: Oversight, the mechanisms through which AI systems are supervised once they are in operation. Evaluation, the methods and standards by which deployed systems are assessed for continued safety, fairness, and reliability. And Accountability, the legal, organizational, and procedural frameworks that assign responsibility when things go wrong. Together, these dimensions define what it means to intentionally govern AI.

Emerging challenges in oversight

Oversight, in the AI governance context, refers to the ongoing supervision of AI systems by regulatory bodies, deploying organizations, and other responsible parties to ensure that those systems continue to operate within legal and ethical limits. Oversight has been tough to translate into operational practice after a system has been deployed and integrated into real-world workflows.

The most major difficulty is the change from one-time compliance to continuous supervision. Most existing regulatory frameworks were designed around a checkpoint model, which works well for static technologies. In a checkpoint model, a system is assessed, approved, and then presumed to be compliant unless something goes visibly wrong. This model works poorly for AI systems that can change their behavior through updates or interaction with new data environments. The EU AI Act attempts to address this by requiring providers of high-risk AI systems to establish post-market monitoring plans that actively collect and review data on system performance throughout the system’s lifetime.[4] But when it comes to practicing such plans, it’s clear that organizations lack the internal capacity or institutional culture to treat compliance as a continuous process.

A model that was validated at the point of release may behave differently after a software update, a change in the underlying data pipeline, or integration into a new organizational workflow. Under the existing frameworks, these changes may not trigger a formal re-assessment, although they alter the system’s risk profile.

The EU AI Act does require providers to document and assess substantial modifications to high-risk AI systems, but what constitutes a “substantial modification” is not clear and is left to the provider to determine.[5] Thus, there is an obvious gap between the intention behind the regulation and its operational reality.

A 2025 AI governance survey found that fewer than half of organizations actively monitor their production AI systems for accuracy, drift, and misuse and that this figure drops to just nine percent among small companies.[6] If deploying organizations themselves are not monitoring their systems, the prospects for effective external oversight are limited.

Dividing responsibility across the AI supply chain makes oversight even more complicated. In most cases, the entity that developed the AI system is not the same as the entity that deploys it, nor is it the same as the entity that integrates it into a workflow or the end users who interact with it daily. The EU AI Act distinguishes between providers and deployers and assigns different obligations to each, but the boundaries between these roles are not always clear. For example, a hospital that deploys a clinical decision-support tool built on a foundation model from a third-party vendor occupies positions of deployer, integrator, and, in some respects, a secondary provider of AI-assisted services. Responsibility in such cases tends to be unclear, with each actor pointing to another as the responsible party.

Emerging challenges in evaluation

Oversight and evaluation are related but different. Oversight is about who supervises AI systems after deployment, and evaluation is about how the quality of those systems is assessed. Without thorough evaluations, oversight would be a baseless form of supervision, and the way that evaluation is currently practiced has some flaws that become more critical once a system is running in real-world scenarios.

With the structural feature of how AI systems are built and tested, benchmarks are designed to measure performance on known tasks under controlled conditions, and they cannot fully anticipate the range of contexts, user behaviors, and edge cases that happen in practice.

Data drift and context drift complicate this problem over time. Data drift occurs when the statistical properties of the inputs a system receives in production deviate from those it was trained on. This is common as user populations change, language evolves, or the external environment changes. For example, a credit-scoring model that was trained on financial behaviors before the COVID pandemic may produce outputs that are technically consistent with its training. But those outputs would be mis-calibrated to the economic state after the pandemic is over. Context drift is more subtle; it refers to changes in the meaning or relevance of a system’s outputs as the social, institutional, or regulatory context in which it operates changes. For example, a content moderation system trained on one cultural context would perform poorly when deployed in another culture. Neither of these failures is necessarily visible in standard performance metrics, and neither would be caught by a one-time pre-deployment evaluation.

When it comes to generative AI systems, the current benchmarks are quite limited. They measure performance on separate tasks rather than on the kinds of open-ended interactions that mimic real-world uses. They are also incapable of accurately assessing properties that matter the most for governance purposes, like fairness across demographic groups, reliability in high-stakes scenarios, and the tendency to generate plausible but false information. The NIST Generative AI Profile identifies hallucination, data privacy, and harmful bias as some of the most significant risks of generative AI systems, and notes that these risks are difficult to assess comprehensively through pre-deployment testing alone.[7]

General-purpose AI systems add an additional challenge to evaluations. Each deployment context has its own risk profile, its own relevant standards, and its own population of users who are affected. The EU AI Act’s provisions on general-purpose AI models acknowledge this complexity.[8] However, they do not clarify how to evaluate a system that has open-ended use cases. Evaluation frameworks designed for narrow, task-specific AI systems do not translate in a straightforward way to systems that can be repurposed to different applications.

Most importantly, there is no internationally agreed standard for post-deployment evaluation of AI systems. The international standard for AI management systems provides a governance framework that includes requirements for monitoring, measurement, and continual improvement,[9] but it does not specify the technical methods that post-deployment performance should be assessed by. The NIST AI RMF provides guidance on measurement and evaluation as part of the broader risk management lifecycle, but its recommendations are voluntary and high-level.[10] As a result, organizations are largely left to design their own post-deployment evaluation approaches, with variations in precision, scope, and methodology.

Emerging challenges in accountability

Accountability in the context of AI governance refers to the capacity to identify who is responsible when an AI system causes harm, who to seek compensation from, and how to ensure that responsible parties face meaningful consequences. It is the most politically and legally challenged dimension of post-deployment governance.

When a traditional software system produces a flawed output, the error can be traced to a specific design decision or a coding mistake. AI systems that are based on machine learning do not work this way. Their outputs result from the interaction of training data with the model’s architecture and deployment context in ways that are unclear even to their developers. When an LLM generates a harmful or false output, or when a predictive policing tool produces a racially biased recommendation, following the chain from design to outcome is nearly impossible. This is a structural feature of these systems that makes traditional accountability frameworks (which depend on the ability to identify a specific act of negligence or a specific design defect) difficult to apply.

Incident reporting processes are weak in most areas. The EU AI Act requires providers of high-risk AI systems to report serious incidents to national market surveillance authorities within defined timeframes. Fifteen days for most cases, two days for widespread violations, and immediately in cases involving death.[11] These are meaningful requirements, but they apply only to high-risk systems as defined by the Act, and they depend on providers having the technical capacity to detect incidents and the commitment to report them.

Outside the EU, mandatory AI incident reporting requirements are rare. The OECD’s work on a common reporting framework for AI incidents, published in February 2025, represents an important step toward international cooperation, but it is a policy paper, and therefore, it is not legally binding.[12] The result is that the global data about AI failures in deployment is incomplete and underreported. This makes it difficult to combine that data in ways that would support systematic learning.

Another factor that impacts accountability is the fact that transparency into the behavior of AI systems is limited. Affected individuals often have no way of knowing that an AI system was involved in a decision that affected them, let alone how that decision was reached. The International Organization for Standardization requires organizations to maintain records of AI system performance and to document their governance processes, but again, adhering to this standard is voluntary and does not guarantee meaningful transparency to external parties.[13] The EU AI Act’s transparency obligations for high-risk systems (including requirements to provide information and to maintain logs of system operation) are stricter, but their practical implementation is still being worked out as the Act’s provisions come into force.

The policy challenge of ensuring accountability without interrupting innovation is valid, but it is exaggerated at times. Accountability requirements can create incentives for more careful system design, more careful testing, and more responsible deployment practices. The risk of the public losing trust in AI systems and governing institutions is more serious than the risk of slowing down innovation.

Conclusion

There is a growing recognition of post-deployment monitoring. This is reflected in the EU AI Act, the NIST AI RMF, the OECD’s incident reporting work, and the International Organization for Standardization management system standard. The risks associated with AI systems evolve and interact with social and institutional environments in unexpected ways. The three dimensions examined in this insight are interconnected aspects of a single governance challenge. Effective post-deployment governance requires all three of them to function together, and the current state of each is incomplete.

Several recommendations follow from this analysis. First, regulatory frameworks need to move beyond checkpoint compliance toward continuous supervision. Second, standardized post-deployment evaluation methods need to be developed urgently. Third, incident reporting needs to be strengthened and extended. The different approaches of the global regulatory landscape, with the EU pursuing binding obligations, the United States moving toward a more sector-specific and voluntary approach, make international coordination difficult to achieve. But since AI systems are being deployed at scale in healthcare, finance, public administration, and criminal justice, the effort to tackle these challenges is warranted. Post-deployment monitoring is the main way in which institutions will determine, over time, whether AI systems remain safe, fair, and legal.

[1] European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). (2024) retrieved April 3, 2026, from eur-lex.europa.eu

[2] National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1). (2024) retrieved April 7, 2026, from nist.gov

[3] OECD. Defining AI Incidents and Related Terms. (2024) retrieved April 7, 2026, from oecd.org

[4] European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). (2024) retrieved April 3, 2026, from eur-lex.europa.eu

[5] Ibid

[6] Gradient Flow. 2025 AI Governance Survey. (2025) retrieved April 10, 2026, from gradientflow.com

[7] National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1). (2024) retrieved April 11, 2026, from nist.gov

[8] European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). (2024) retrieved April 3, 2026, from eur-lex.europa.eu

[9] International Organization for Standardization. ISO/IEC 42001:2023, Information technology, Artificial intelligence, Management system. (2023) retrieved April 8, 2026, from iso.org

[10] National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1). (2024) retrieved April 11, 2026, from nist.gov

[11] European Union. Proposal for a Directive on adapting non-contractual civil liability rules to artificial intelligence (AI Liability Directive). Legislative status: proposal withdrawn. (2022) retrieved April 11, 2026, from https://www.europarl.europa.eu/legislative-train/theme-a-europe-fit-for-the-digital-age/file-ai-liability-directive

[12] OECD. Defining AI Incidents and Related Terms. (2024) retrieved April 7, 2026, from oecd.org

[13] International Organization for Standardization. ISO/IEC 42001:2023, Information technology, Artificial intelligence, Management system. (2023) retrieved April 8, 2026, from iso.org

Regions

Topics

The Post-Deployment Monitoring of Artificial Intelligence: Emerging Challenges in Oversight, Evaluation, and Accountability

The Post-Deployment Monitoring of Artificial Intelligence: Emerging Challenges in Oversight, Evaluation, and Accountability

Noor Saif Fares Al Mazrouei

Related Topics

AI and Advanced Technology | International Security

Africa | Asia

Has Iran Exposed Everything?

AI and Advanced Technology

Asia

Artificial Intelligence and Energy Security in the Gulf: Strategic Implications of the Iranian Conflict

AI and Advanced Technology

Digital Deterrence Architecture for the Protection of AI Data Centers

AI and Advanced Technology

Digital Pathways: Shaping Governance in the AI Age