AI researchers and labs have made significant strides in evaluating AI models for various aspects, including safety, compliance, sycophancy, and alignment. However, companies and developers now face a specific need: ensuring their AI systems behave as intended for their specific products or services. To simplify this testing process, Microsoft unveiled ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) on Tuesday. This open-source framework aims to make the evaluation of application-specific AI behavior straightforward by using AI to convert high-level, natural-language descriptions of goals, policies, or intended behaviors into comprehensive, scored tests. ASSERT takes plain-language descriptions of an AI model's expected behavior and policies, transforms them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, allowing developers to inspect where failures occur. Developers can provide system context, tools, and constraints to further customize the evaluations. For example, a developer could specify that a document research AI agent should not send emails to people outside the company, should limit confidential information to C-level executives, and provide concise summaries while considering prior context. ASSERT will use these rules to generate test cases that check whether the system adheres to these rules continuously. According to Microsoft, the framework fills a gap that broader evaluations cannot address when AI models are intended to behave according to application or product context, policies, and tools. "One of the things we’ve learned is that evaluations are absolutely critical to making good decisions," said Sarah Bird, Chief Product Officer of Responsible AI at Microsoft. "If you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s standards… We found that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific." Bird mentioned that ASSERT can be used to evaluate systems during development, after deployment, and even for continuous monitoring. This release coincides with a broader shift in the AI industry. As models become more capable, researchers are focusing on repeatable testing and regression checks, with Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR rolling out benchmarks to measure how models behave under various conditions.
Blogger's Review: The launch of Microsoft's ASSERT tool marks a significant shift towards more focused testing of AI behaviors tailored to specific applications. By converting natural language into executable test cases, developers can more efficiently ensure the reliability and compliance of AI systems. This advancement not only enhances AI safety but also provides robust support for future continuous monitoring.