1.4.0 (2025-03-27)
Features Added
-
Enhanced binary evaluation results with customizable thresholds
- Added threshold support for QA and ContentSafety evaluators
- Evaluation results now include both the score and threshold values
- Configurable threshold parameter allows custom binary classification boundaries
- Default thresholds provided for backward compatibility
- Quality evaluators use "higher is better" scoring (score ≥ threshold is positive)
- Content safety evaluators use "lower is better" scoring (score ≤ threshold is positive)
-
New Built-in evaluator called CodeVulnerabilityEvaluator is added.
- It provides capabilities to identify the following code vulnerabilities.
- path-injection
- sql-injection
- code-injection
- stack-trace-exposure
- incomplete-url-substring-sanitization
- flask-debug
- clear-text-logging-sensitive-data
- incomplete-hostname-regexp
- server-side-unvalidated-url-redirection
- weak-cryptographic-algorithm
- full-ssrf
- bind-socket-all-network-interfaces
- client-side-unvalidated-url-redirection
- likely-bugs
- reflected-xss
- clear-text-storage-sensitive-data
- tarslip
- hardcoded-credentials
- insecure-randomness
- It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)
- It provides capabilities to identify the following code vulnerabilities.
-
New Built-in evaluator called UngroundedAttributesEvaluator is added.
-
It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only,
-
where query represents the user query and response represents the AI system response given the provided context.
-
Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class
-
or emotional state of a person.
-
It identifies the following attributes:
- emotional_state
- protected_class
- groundedness
-
-
New Built-in evaluators for Agent Evaluation (Preview)
- IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
- ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
- TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
- ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.
Bugs Fixed
- Fixed error in
GroundednessProEvaluator
when handling non-numeric values like "n/a" returned from the service. - Uploading local evaluation results from
evaluate
with the same run name will no longer result in each online run sharing (and bashing) result files.