github DataDog/dd-trace-py v4.6.0
4.6.0

9 hours ago

Estimated end-of-life date, accurate to within three months: 05-2027
See the support level definitions for more information.

Upgrade Notes

  • LLM Observability
    • Experiments spans now contain config from the experiment initialization, allowing for searching of relevant spans using the experiment config.
    • Experiments spans now contain the tags from the dataset records, allowing for searching of relevant spans using the dataset record tags.

Deprecation Notes

  • tracing
    • The type annotation for Span.parent_id will change from Optional[int] to int in v5.0.0.

New Features

  • azure-api-management
    • This introduces inferred proxy support for Azure API Management.
  • Stats computation
    • Enable stats computation by default for python 3.14 and above.
  • AI Guard
    • Adds SDS (Sensitive Data Scanner) findings to AI Guard spans, enabling visibility into sensitive data detected in LLM inputs and outputs.
  • LLM Observability
    • Experiments now report their execution status to the backend. Status transitions to running when execution starts, completed on success, failed when tasks or evaluators error with raise_errors=False, and interrupted when the experiment is stopped by an exception. #16713

    • Adds LLMObs.publish_evaluator() to sync a locally-defined LLMJudge evaluator to the Datadog UI as a custom LLM-as-Judge evaluation.

    • Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from BaseMetric or BaseConversationalMetric) in an LLM Obs Experiment.

      Example:

      from deepeval.metrics import GEval
      from deepeval.test_case import LLMTestCaseParams
      
      from ddtrace.llmobs import LLMObs
      
      correctness_metric = GEval(
          name="Correctness",
          criteria="Determine whether the actual output is factually correct based on the expected output.",
          evaluation_steps=[
              "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
              "You should also heavily penalize omission of detail",
              "Vague language, or contradicting OPINIONS, are OK"
          ],
          evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
          async_mode=True
      )
      
      dataset = LLMObs.create_dataset(
          dataset_name="<DATASET_NAME>",
          description="<DATASET_DESCRIPTION>",
          records=[RECORD_1, RECORD_2, RECORD_3, ...]
      )
      
      def my_task(input_data, config):
          return input_data["output"]
      
      def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results):
          return evaluators_results["Correctness"].count(True)
      
      experiment = LLMObs.experiment(
          name="<EXPERIMENT_NAME>",
          task=my_task,
          dataset=dataset,
          evaluators=[correctness_metric],
          summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results
          description="<EXPERIMENT_DESCRIPTION>."
      )
      
      result = experiment.run()
      
    • adds experiment summary logging after run() with row count, run count, per-evaluator stats, and error counts.

    • adds max_retries and retry_delay parameters to experiment.run() for retrying failed tasks and evaluators. Example: experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).

    • This introduces LLMObs.get_prompt() to retrieve managed prompts from Datadog's Prompt Registry. The method returns a ManagedPrompt object with a format()
      method for variable substitution. Prompt updates propagate to running applications within the cache TTL (default: 60 seconds).
      Use with annotation_context or annotate to correlate prompts with LLM spans:

      prompt = LLMObs.get_prompt("greeting")
      variables = {"user": "Alice"}
      with LLMObs.annotation_context(prompt=prompt.to_annotation_dict(**variables)):
          openai.chat.completions.create(messages=prompt.format(**variables))
    • experiments propagate canonical_ids from dataset records to the corresponding experiments span when present. The canonical_ids are only guaranteed to be available after calling pull_dataset.

    • LLMObs.create_dataset supports a bulk_upload parameter to control data uploading behavior. Both LLMObs.create_dataset and LLMObs.create_dataset_from_csv supports users specifying the deduplicate parameter.

    • Subset of dataset records can now be pulled with tags by using the tags argument to LLMObs.pull_dataset, provided in a list of strings of key value pairs: LLMObs.pull_dataset(dataset_name="my-dataset", tags=["env:prod", "version:1.0"])

Bug Fixes

  • LLM Observability
    • Fix data duplication issue when uploading > 5MB datasets via LLMObs.create_dataset.
  • ai_guard
    • Fix TypeError while processing failed AI Guard responses, leading to overriding the original error.
  • openai_agents
    • Fixes an AttributeError on openai-agents >= 0.8.0 caused by the removal of AgentRunner._run_single_turn.
  • profiling
    • A bug which could prevent Profiling from being enabled when the library is installed through Single Step Instrumentation was fixed.
    • This fixes an issue where the profiler was patching the gevent module unnecessarily even when the profiler was not enabled.
    • A bug that would cause certain function names to be displayed as <module> in flame graphs has been fixed.
    • Fix lock contention in the profiler's greenlet stack sampler that could cause connection pool exhaustion in gevent-based applications (e.g. gunicorn + gevent + psycopg2). #16657
    • This fix resolves an issue where the lock profiler's wrapper class did not support PEP 604 type union syntax (e.g., asyncio.Condition | None). This was causing a TypeError at import time for libraries such as kopf that use union type annotations at class definition time.
  • data_streams
    • Add kafka_cluster_id tag to Kafka offset/backlog tracking for confluent-kafka. Previously, cluster ID was only included in DSM checkpoint edge tags (produce/consume) but missing from offset commit and produce offset backlogs. This ensures correct attribution of backlog data to specific Kafka clusters when multiple clusters share topic names.
  • AAP
    • Fixes a memory corruption issue where concurrent calls to the WAF on the same request context from multiple threads (e.g. an asyncio event loop and a thread pool executor inheriting the same context via contextvars) could cause use-after-free or double-free crashes (SIGSEGV) inside libddwaf. A per-context lock now serializes WAF calls on the same context.
  • tracing
    • Avoid pickling wrappers in ddtrace.internal.wrapping.context.BaseWrappingContext.
  • CI Visibility
    • Fixed an incompatibility with pytest-html and other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standard dd_retry test outcome for retry attempts. The outcome is now set to rerun, which is the standard value used by pytest-rerunfailures and recognized by reporting plugins.
  • dynamic instrumentation
    • Fixes a RuntimeError: generator didn't yield in the Symbol DB remote config subscriber when the process has no writable temporary directory.
  • celery
    • Propagate distributed tracing headers for tasks that are not registered locally so traces link correctly across workers. #16662
  • Fix for a potential race condition affecting internal periodic worker threads that could have caused a RuntimeError during forks.
  • Add a timeout to Unix socket connections to prevent thread I/O hangs during pre-fork shutdown.

Other Changes

  • profiling
    • reduces code provenance CPU overhead when using fork-based frameworks like gunicorn and uWSGI.
  • LLM Observability
    • Exports LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, and CategoricalStructuredOutput to the public ddtrace.llmobs module level.

Don't miss a new dd-trace-py release

NewReleases is sending notifications on new releases.