Azure/azure-sdk-for-python azure-ai-evaluation

1.16.7 (2026-05-07)

Features Added

Added extra_headers keyword argument to RaiServiceEvaluatorBase (and all content safety evaluators) to allow passing custom HTTP headers to all backend RAI service calls. SDK-owned headers (Authorization, User-Agent, Content-Type, aml-user-token, x-ms-client-request-id) cannot be overridden by extra_headers.
Added status field ("completed", "error", "skipped") on evaluation result items to indicate evaluator execution outcome.
Added skipped and errored counts to result_counts and per_testing_criteria_results in AOAI evaluation summaries.
Added skipped to ResultCount and skipped/errored to PerTestingCriteriaResult typed contracts.

Bugs Fixed

_TaskNavigationEfficiencyEvaluator now accepts JSON-stringified response and ground_truth inputs (e.g., from data pipelines that serialize list/tuple inputs to strings). String inputs are parsed as JSON; on parse failure the original value is preserved so downstream validation surfaces the error as before.
Fixed error blame attribution in _get_single_run_results to perform a case-insensitive comparison when checking the AOAI error code for UserError, ensuring failed evaluation runs are correctly classified as user errors regardless of server-side casing.
Fixed deflection_rate evaluator showing incorrect pass/fail labels where all results were labeled "pass" regardless of the actual score. The inverse metric adjustment was overriding the evaluator's correct string labels, remapping every result to "pass".
Fixed evaluate() raising EvaluationException: (InternalError) unhashable type: 'list' when an evaluator emitted a list value under a _result-suffixed column. Binary aggregation now skips such columns with a warning instead of aborting the entire run.
Fixed task_adherence red team scoring by adding scenario=redteam to the RAI scorer evaluation payload, ensuring the server-side score mapping correctly routes to Direct mapping for attack success determination.
Fixed row classification double-counting in _calculate_aoai_evaluation_summary where errored rows were counted separately and could also be counted as passed/failed. Rows are now classified into mutually exclusive buckets with priority: passed > failed > errored > skipped.
Fixed row classification where rows with empty or missing results lists were incorrectly counted as "passed" (the condition passed_count == len(results) - error_count evaluated 0 == 0 as True).
Fixed _get_metric_result prefix matching where shorter metric names (e.g., xpia) could match before longer, more-specific ones (e.g., xpia_manipulated_content). Now sorts by length descending for correct longest-prefix matching.
Fixed non-dict _properties values from evaluators causing downstream issues. Values that are not dicts are now logged and dropped gracefully.
Fixed filename length error in _inline_image by catching OSError/ValueError during local path resolution and fall back to returning a text chunk instead of throwing.

Other Changes

Moved token usage attributes (gen_ai.evaluation.usage.input_tokens, gen_ai.evaluation.usage.output_tokens) from standard App Insights event attributes into the internal_properties JSON bag to align with internal telemetry conventions.

Azure/azure-sdk-for-python azure-ai-evaluation_1.16.7 on GitHub

1.16.7 (2026-05-07)

Features Added

Bugs Fixed

Other Changes

Azure/azure-sdk-for-python azure-ai-evaluation_1.16.7
on GitHub