1.16.7 (2026-05-07)
Features Added
-
Added
extra_headerskeyword argument toRaiServiceEvaluatorBase(and all content safety evaluators) to allow passing custom HTTP headers to all backend RAI service calls. SDK-owned headers (Authorization,User-Agent,Content-Type,aml-user-token,x-ms-client-request-id) cannot be overridden byextra_headers. -
Added
statusfield ("completed","error","skipped") on evaluation result items to indicate evaluator execution outcome. -
Added
skippedanderroredcounts toresult_countsandper_testing_criteria_resultsin AOAI evaluation summaries. -
Added
skippedtoResultCountandskipped/erroredtoPerTestingCriteriaResulttyped contracts.
Bugs Fixed
_TaskNavigationEfficiencyEvaluatornow accepts JSON-stringifiedresponseandground_truthinputs (e.g., from data pipelines that serialize list/tuple inputs to strings). String inputs are parsed as JSON; on parse failure the original value is preserved so downstream validation surfaces the error as before.- Fixed error blame attribution in
_get_single_run_resultsto perform a case-insensitive comparison when checking the AOAI error code forUserError, ensuring failed evaluation runs are correctly classified as user errors regardless of server-side casing. - Fixed
deflection_rateevaluator showing incorrect pass/fail labels where all results were labeled "pass" regardless of the actual score. The inverse metric adjustment was overriding the evaluator's correct string labels, remapping every result to "pass". - Fixed
evaluate()raisingEvaluationException: (InternalError) unhashable type: 'list'when an evaluator emitted a list value under a_result-suffixed column. Binary aggregation now skips such columns with a warning instead of aborting the entire run. - Fixed
task_adherencered team scoring by addingscenario=redteamto the RAI scorer evaluation payload, ensuring the server-side score mapping correctly routes to Direct mapping for attack success determination. - Fixed row classification double-counting in
_calculate_aoai_evaluation_summarywhere errored rows were counted separately and could also be counted as passed/failed. Rows are now classified into mutually exclusive buckets with priority: passed > failed > errored > skipped. - Fixed row classification where rows with empty or missing results lists were incorrectly counted as "passed" (the condition
passed_count == len(results) - error_countevaluated0 == 0as True). - Fixed
_get_metric_resultprefix matching where shorter metric names (e.g.,xpia) could match before longer, more-specific ones (e.g.,xpia_manipulated_content). Now sorts by length descending for correct longest-prefix matching. - Fixed non-dict
_propertiesvalues from evaluators causing downstream issues. Values that are not dicts are now logged and dropped gracefully. - Fixed filename length error in
_inline_imageby catching OSError/ValueError during local path resolution and fall back to returning a text chunk instead of throwing.
Other Changes
- Moved token usage attributes (
gen_ai.evaluation.usage.input_tokens,gen_ai.evaluation.usage.output_tokens) from standard App Insights event attributes into theinternal_propertiesJSON bag to align with internal telemetry conventions.