New Features
- AI Guard: Adds SDS (Sensitive Data Scanner) findings to AI Guard spans, enabling visibility into sensitive data detected in LLM inputs and outputs.
-
LLM Observability: Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from
BaseMetricorBaseConversationalMetric) in an LLM Obs Experiment.Example:
from deepeval.metrics import GEval from deepeval.test_case import LLMTestCaseParams from ddtrace.llmobs import LLMObs correctness_metric = GEval( name="Correctness", criteria="Determine whether the actual output is factually correct based on the expected output.", evaluation_steps=[ "Check whether the facts in 'actual output' contradicts any facts in 'expected output'", "You should also heavily penalize omission of detail", "Vague language, or contradicting OPINIONS, are OK" ], evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], async_mode=True ) dataset = LLMObs.create_dataset( dataset_name="<DATASET_NAME>", description="<DATASET_DESCRIPTION>", records=[RECORD_1, RECORD_2, RECORD_3, ...] ) def my_task(input_data, config): return input_data["output"] def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results): return evaluators_results["Correctness"].count(True) experiment = LLMObs.experiment( name="<EXPERIMENT_NAME>", task=my_task, dataset=dataset, evaluators=[correctness_metric], summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results description="<EXPERIMENT_DESCRIPTION>." ) result = experiment.run()
- LLM Observability: adds experiment summary logging after
run()with row count, run count, per-evaluator stats, and error counts.
- LLM Observability: adds
max_retriesandretry_delayparameters toexperiment.run()for retrying failed tasks and evaluators. Example:experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).
Bug Fixes
- AAP: Fixes a memory corruption issue where concurrent calls to the WAF on the same request context from multiple threads (e.g. an asyncio event loop and a thread pool executor inheriting the same context via
contextvars) could cause use-after-free or double-free crashes (SIGSEGV) insidelibddwaf. A per-context lock now serializes WAF calls on the same context.
- tracing: Avoid pickling wrappers in
ddtrace.internal.wrapping.context.BaseWrappingContext.
- CI Visibility: Fixed an incompatibility with
pytest-htmland other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standarddd_retrytest outcome for retry attempts. The outcome is now set torerun, which is the standard value used bypytest-rerunfailuresand recognized by reporting plugins.
- dynamic instrumentation: Fixes a
RuntimeError: generator didn't yieldin the Symbol DB remote config subscriber when the process has no writable temporary directory.
- profiling: A bug that would cause certain function names to be displayed as
<module>in flame graphs has been fixed.