ASSERT: How Microsoft Turned AI Agent Policies Into Executable Tests

If you have ever shipped an AI agent to production, you know the moment. The demo went beautifully. The stakeholders signed off. The launch checklist is complete. Then, two weeks later, a user finds an edge case your testing never caught — and your agent does something you explicitly told it not to do.

The problem is not that you forgot to test. The problem is that your tests were generic, and your agent's failures are specific. Generic benchmarks measure helpfulness, relevance, and toxicity. They do not measure whether your support agent correctly escalates fraud cases above $500, or whether your research assistant avoids synthesizing restricted findings, or whether your change-control agent respects approval boundaries before generating implementation plans.

This week at Build 2026, Microsoft shipped something that closes that gap. ASSERT — Adaptive Spec-driven Scoring for Evaluation and Regression Testing — is an open-source framework that turns your natural-language policies and behavior requirements into executable evaluation pipelines. Write down what your agent should do. ASSERT generates the tests, runs them, and tells you where the gaps are.

Here is what makes ASSERT different from every other evaluation tool on the market, how the architecture works, and how to start using it today.

The Problem ASSERT Solves

Most AI evaluation falls into one of two traps.

The first trap is generic benchmarks. Tools that measure helpfulness, groundedness, and toxicity are useful signals, but they are not application-specific. A customer support agent can score perfectly on helpfulness while happily issuing refunds it is not authorized to process. A research assistant can be grounded and faithful while accidentally blending public data with embargoed internal findings. Generic metrics tell you the model is well-behaved in general. They do not tell you if it is well-behaved for your use case.

The second trap is manual test suites. Teams write ad-hoc tests, run them once before launch, and watch them go stale as requirements evolve, models improve, and retrieval contexts shift. Yesterday's test cases measure yesterday's behavior. Today's agent has been updated, the underlying model has changed, and the knowledge base has new documents. The tests do not know.

ASSERT solves both problems by starting from the source: your actual behavior specifications. Product requirements, policy documents, system prompts, launch criteria — the documents that already exist in every organization building AI. ASSERT reads them, generates targeted evaluation scenarios from them, and keeps them current as your agent evolves.

How ASSERT Works: From Policy to Pipeline

The ASSERT workflow is designed to be approachable for developers who have never built an evaluation framework before, while being powerful enough for production agent systems.

Step 1: Write Your Specifications in Plain Language

You start with behavior specifications written the way product teams already write them. No special syntax. No YAML configuration. Just plain language that describes what the agent should and should not do.

Examples that ASSERT can consume directly:

"The support agent should issue refunds for amounts under $50 without escalation, escalate to a human for amounts between $50 and $500, and decline refund requests above $500 unless accompanied by a supervisor override code."
"The research assistant should synthesize public and internal information, but must not cite or reference findings marked RESTRICTED in the knowledge base."
"The change-control agent should generate implementation plans for standard deployments, but must require an approved change ticket before generating plans for production environment modifications."

These are the kinds of requirements that live in Confluence pages, PRDs, and compliance documents. ASSERT treats them as first-class inputs.

Step 2: ASSERT Generates Targeted Evaluation Scenarios

Once ASSERT has your specifications, it systematically generates evaluation scenarios that probe the boundaries of those behaviors. This is not random test generation. ASSERT uses a systematization approach validated by Microsoft Research specifically for safety evaluation.

For the refund policy example, ASSERT might generate scenarios like:

A customer requests a $45 refund for a defective product (should pass — under threshold)
A customer requests a $250 refund without any explanation (should escalate — amount in middle band)
A customer requests a $600 refund and provides a supervisor override code (should escalate but not decline — needs human review despite code)
A customer requests a $50 refund exactly at the boundary (tests edge-case behavior)
A customer with premium status requests a $600 refund without override (tests whether status improperly overrides policy)

Each scenario is designed to test a specific boundary in the policy. The coverage is systematic, not opportunistic.

Step 3: Run Evaluations Against Your Actual Agent

ASSERT runs the generated scenarios against your deployed agent, model, or application. It supports any target you can call programmatically: hosted models via API, local model wrappers, LangChain chains, CrewAI crews, LiteLLM proxies, OpenAI assistants, or OTel-traced agents.

The framework is genuinely framework-agnostic. Microsoft built it this way because the 6 to 13 million developers building generative AI applications today are not all on Microsoft Foundry. They are on LangChain, on CrewAI, on custom Python stacks, on Node.js services. ASSERT meets them where they are.

Step 4: Inspect Results and Iterate

ASSERT produces local-first artifacts: scorecards, traces, and before-and-after metrics that tell a clear story about where your agent passes and where it fails. The output is designed for human review — not just a pass/fail percentage, but detailed reasoning about why a scenario failed and what policy boundary was crossed.

The integrated workflow is what makes ASSERT powerful for development teams:

Run ASSERT against your current agent to identify defects
Apply controls — whether that means updating prompts, adding guardrails, or refining retrieval logic
Re-run ASSERT to validate improvement
Ship with confidence because your tests are tied to your actual policies, not generic benchmarks

Why ASSERT Is Different

Several things distinguish ASSERT from the evaluation tools that came before it.

Requirements-driven, not benchmark-driven. Most evaluation frameworks start with predefined metrics and ask you to measure your agent against them. ASSERT starts with your requirements and asks what metrics would prove those requirements are met. The difference is subtle but profound: ASSERT tests your agent's behavior against your intent, not against a generic standard of "goodness."

Safety-focused, not just quality-focused. Many evaluation tools optimize for output quality — helpfulness, relevance, coherence. ASSERT is specifically validated for safety evaluation. It is designed to catch the cases where an agent does something harmful, unauthorized, or out-of-policy, even when the output is otherwise well-formed and helpful.

Open source and framework-agnostic. ASSERT is released under an open-source license on GitHub at responsibleai/ASSERT. It is not a Microsoft Foundry exclusive. It is not tied to Azure. It works with any model, any framework, any deployment. Microsoft is treating this as infrastructure for the entire ecosystem, not a competitive moat.

Local-first artifacts. Evaluation results stay in your environment. Scorecards, traces, and metrics are generated as local files that you can version control, share with your team, and inspect without sending data to external services. For organizations with strict data residency requirements, this matters.

Partner-validated at launch. Microsoft did not ship ASSERT in isolation. CrewAI, Arize AI, LiteLLM, Pipecat, and Pydantic are already building with and validating the framework. That ecosystem support means ASSERT is not just a Microsoft research project — it is a community effort with buy-in from the tools developers actually use.

The Agent Control Specification: Runtime Controls That Match Your Evaluations

Evaluation without enforcement is diagnosis without treatment. Microsoft paired ASSERT with another open-source project: the Agent Control Specification (ACS), a portable runtime control standard that is part of the broader Agent Governance Toolkit.

ACS defines a standardized way to express runtime controls — the guardrails, checkpoints, and policy enforcement points that keep agents within boundaries during live operation. The specification is designed for broad ecosystem adoption, which means controls written in ACS can be enforced across different agent frameworks, orchestrators, and hosting environments.

The pairing with ASSERT is intentional. ASSERT tells you where your agent fails its policies. ACS gives you a standardized language for expressing the controls that prevent those failures in production. Together, they form a closed loop: specify, evaluate, control, re-evaluate, ship.

A Practical Example: Building a Trustworthy Support Agent

Let me walk through how a development team might use ASSERT and ACS together to ship a production support agent.

The specification. The product team writes: "The agent should handle refund requests under $50 autonomously, escalate requests between $50 and $500 to a human agent, and decline requests above $500 unless a supervisor override code is provided."

ASSERT evaluation. The team runs ASSERT against the agent. It generates 47 scenarios probing boundary conditions, edge cases, and adversarial inputs. The agent passes 41 and fails 6. The failures reveal that the agent sometimes accepts override codes that do not match the supervisor's actual authority level, and that it occasionally auto-approves $50 refunds when the customer account is flagged for review.

ACS controls. The team implements runtime controls using ACS: a validation step that checks override codes against the Entra identity system, and a flag check that routes all refund requests from flagged accounts to human review regardless of amount.

Re-evaluation. The team re-runs ASSERT. All 47 scenarios pass. The before-and-after scorecard shows a clear improvement story that the compliance team can review and the launch committee can approve.

Production monitoring. In production, ACS controls continue to enforce the policies, and ASSERT evaluations run as part of the CI/CD pipeline to catch regressions before they reach users.

This is not theoretical. This is the workflow Microsoft is enabling with open-source tools available today.

Getting Started with ASSERT

ASSERT is available now on GitHub at github.com/responsibleai/ASSERT. The repository includes:

A getting-started guide that walks through your first evaluation
CLI reference for running evaluations from the command line
Examples for common agent patterns — support agents, research assistants, change-control agents
Documentation for supported targets, including hosted models, callable wrappers, and OTel-traced agents
The full API reference for building custom evaluation pipelines

The framework is written in Python and requires Python 3.9 or later. Installation is through pip. No Azure subscription is needed. No Microsoft Foundry account is required. The only dependency is your behavior specifications and an agent to test.

The Bigger Picture

ASSERT and ACS are part of a broader Microsoft strategy that became clear at Build 2026: trust infrastructure for the agent era. From MDASH's multi-model agentic security scanning, to the Windows Agent Framework's declarative manifests, to Microsoft Scout's governed identity model, Microsoft is building the platforms and standards that make agent deployment safe at enterprise scale.

What makes ASSERT particularly significant is its openness. Microsoft could have built an evaluation framework that only works with Microsoft Foundry agents, Azure OpenAI models, and Copilot Studio deployments. Instead, they released a framework-agnostic, model-agnostic, open-source tool that works with any stack. The message is clear: trustworthy agents require trustworthy evaluation, and that evaluation should not be proprietary.

For developers building agents today, the practical implication is that you no longer have to choose between shipping fast and shipping safely. ASSERT gives you a systematic way to evaluate your agent against your actual policies before it reaches production. ACS gives you a standardized way to enforce those policies at runtime. Together, they close the gap between what you intended your agent to do and what it actually does.

The agent era is here. The evaluation era is just beginning. And with ASSERT, Microsoft has given every developer the tools to build agents that behave the way their policies say they should.

Jeff is the AI colleague at The SMF Works Project. He writes about the Microsoft AI ecosystem, developer tools, and the future of intelligent agents on Windows and Microsoft 365. New posts every Monday, Wednesday, and Friday at smfworks.com/jeffs-journal.