We're excited to announce MLflow 3.9.0, which includes several notable updates:
Major New Features:
- 🔮 MLflow Assistant: Figuring out the next steps to debug your apps and agents can be challenging. We're excited to introduce the MLflow Assistant, an in-product chatbot that can help you identify, diagnose, and fix issues. The assistant is backed by Claude Code, and directly passes context from the MLflow UI to Claude. Click on the floating "Assistant" button in the bottom right of the MLflow UI to get started!
- 📈 Trace Overview Dashboard: You can now get insights into your agent's performance at a glance with the new "Overview" tab in GenAI experiments. Many pre-built statistics are available out of the box, including performance metrics (e.g. latency, request count), quality metrics (based on assessments), and tool call summaries. If there are any additional charts you'd like to see, please feel free to raise an issue in the MLflow repository!
- ✨ AI Gateway: We're revamping our AI Gateway feature! AI Gateway provides a unified interface for your API requests, allowing you to route queries to your LLM provider(s) of choice. In MLflow 3.9.0, the Gateway server is now located directly in the tracking server, so you don't need to spin up a new process. Additional features such as passthrough endpoints, traffic splits, and fallback models are also available, with more to come soon! For more detailed information, please take a look at the docs.
- 🔎 Online Monitoring with LLM Judges: Configure LLM judges to automatically run on your traces, without having to write a line of code! You can either use one of our pre-defined judges, or provide your own prompt and instructions to create custom metrics. Head to the new "Judges" tab within the GenAI Experiment UI to get started.
- 🤖 Judge Builder UI: Define and iterate on custom LLM judge prompts directly from the UI! Within the new "Judges" tab, you can create your own prompt for an LLM judge, and test-run it on your traces to see what the output would be. Once you're happy with it, you can either use it for online monitoring (as mentioned above), or use it via the Python SDK for your evals.
- 🔗 Distributed Tracing: Trace context can now be propagated across different services and processes, allowing you to truly track request lifecycles from end to end. The related APIs are defined in the
mlflow.tracing.distributedmodule (with more documentation to come soon). - 📚 MemAlign - a new judge optimizer algorithm: We're excited to introduce
MemAlignOptimizer, a new algorithm that makes your judges smarter over time. It learns general guidelines from past feedback while dynamically retrieving relevant examples at runtime, giving you more accurate evaluations.
Features:
- [Gateway] Add LiteLLM provider to support many other providers (#19394, @TomeHirata)
- [Gateway] Add passthrough support for Anthropic Messages API (#19423, @TomeHirata)
- [Gateway] Add passthrough support for Gemini
generateContentandstreamGenerateContentAPIs (#19425, @TomeHirata) - [Gateway] Add routing strategy and fallback configuration support for gateway endpoints (#19483, @TomeHirata)
- [Gateway] Deprecate Unity Catalog function integration in AI Gateway (#19457, @harupy)
- [Gateway / UI] Create List API Keys landing page (#19441, @BenWilson2)
- [Gateway / UI] Add Create API Keys functionality (#19442, @BenWilson2)
- [Gateway / UI] Add delete and update capabilities for API Keys (#19446, @BenWilson2)
- [Gateway / UI] Add endpoint listing page and tab layout (#19474, @BenWilson2)
- [Gateway / UI] Add Create endpoint page and enhance provider select (#19475, @BenWilson2)
- [Gateway / UI] Add Model select functionality for endpoint creation (#19477, @BenWilson2)
- [Gateway / UI] Add Auth config to endpoint creation (#19494, @BenWilson2)
- [Gateway / UI] Add the Endpoint Edit Page (#19502, @BenWilson2)
- [Gateway / UI] Refactor the provider display for better UX (#19503, @BenWilson2)
- [Gateway / UI] Create Endpoint details page (#19537, @BenWilson2)
- [Gateway / UI] Add security notice banner (#19538, @BenWilson2)
- [Gateway / UI] Create common editable combo box with extra modal select (#19546, @BenWilson2)
- [Evaluation] Introduce
MemAlignas a new optimizer for judge alignment (#19598, @smoorjani) - [Evaluation] Parallelize LLM calls in
MemAlignguideline distillation (#20291, @veronicalyu320) - [Evaluation] Add
GePaAlignmentOptimizerfor judge instruction optimization (#19882, @alkispoly-db) - [Evaluation] Add
Fluencyscorer for evaluating text quality (#19414, @alkispoly-db) - [Evaluation] Add
KnowledgeRetentionbuilt-in scorer (#19436, @alkispoly-db) - [Evaluation] Implement automatic discovery for builtin scorers (#19443, @alkispoly-db)
- [Evaluation] Add Phoenix (Arize) third-party scorer integration (#19473, @debu-sinha)
- [Evaluation] Add gateway provider support for scorers (#19470, @danielseong1)
- [Evaluation] Introduce a conversation simulator into
mlflow.genai(#19614, @smoorjani) - [Evaluation] Integrate conversation simulation into
mlflow.genai.evaluate(#19760, @smoorjani) - [Evaluation] Make conversation simulator work with datasets (#19845, @SomtochiUmeh)
- [Evaluation] Support for conversational datasets with persona, goal, and context (#19686, @SomtochiUmeh)
- [Evaluation] Introduce conversational guidelines scorer (#19729, @smoorjani)
- [Evaluation] Update tool call correctness judge to accept expected tool calls (#19613, @smoorjani)
- [Evaluation] Support trace parsing fallback using Databricks model (#19654, @AveshCSingh)
- [Evaluation] Documentation for online evaluation / scoring (#20103, @dbczumar)
- [Evaluation] Job backend: Update job backend to use static names rather than function full names (#19430, @WeichenXu123)
- [Evaluation] Job backend: support job cancellation (#19565, @WeichenXu123)
- [Tracing] Support distributed tracing (#19920, @WeichenXu123)
- [Tracing] Trace Metrics backend (#19271, @serena-ruan)
- [Tracing] Add
IS NULL/IS NOT NULLcomparator support for trace metadata filtering (#19720, @dbczumar) - [Tracing] Auto-navigate to Events tab when clicking error spans (#20188, @anshuman-sahu)
- [Tracing] Support shift+select for Traces (#20125, @B-Step62)
- [Tracing] SpringAI Integration (#19949, @joelrobin18)
- [Tracing] Reasoning in Chat UI for OpenAI, Anthropic, Gemini, Langchain, and PydanticAI (#19535, #19541, #19627, #19651, #19657, @joelrobin18)
- [UI] Merge MLflow Assistant branch (#20011, @B-Step62)
- [UI] Current Page context to assistant (#20139, @joelrobin18)
- [UI] Assistant regenerate button (#20066, @joelrobin18)
- [UI] Copy button Assistant (#20063, @joelrobin18)
- [UI] Overview tab for GenAI experiments (#19521, @serena-ruan)
- [UI] Enable Scorers UI feature flags (#19842, @danielseong1)
- [UI] Improve LLM judge creation modal UX and variable ordering (#19963, @danielseong1)
- [UI] Hide instructions section for built-in LLM judges (#19883, @danielseong1)
- [UI] Change model provider and name to dropdown list (#19653, @chenmoneygithub)
- [Prompts] Support Jinja2 template in prompt registry (#19772, @B-Step62)
- [Prompts] Support metaprompting in
mlflow.genai.optimize_prompts()(#19762, @chenmoneygithub) - [Prompts] Add option to delegate saving dspy model to
dspy.module.saveAPI (#19704, @WeichenXu123) - [Prompts / UI] Add traces mode to prompts details page and implement filtered traces (#19599, @TomeHirata)
- [Tracking] Support
mlflow.genai.to_predict_fnfor app invocation endpoints (#19779, @jennsun) - [Tracking] Add
log_streamAPI for logging binary streams as artifacts (#19104, @harupy) - [Tracking] Add
import_checkpointsAPI for databricks SGC Checkpointing with MLflow (#19839, @WeichenXu123) - [Tracking] Support GC clean up for Historical Jobs (#19626, @joelrobin18)
- [Tracking] Add
JupyterNotebookRunContextfor Tracking local Jupyter notebook as the source (#19162, @iyashk) - [Tracking] Full docker image support with db (#19979, @serena-ruan)
- [Tracking] Add react route handling to communicate with the tracking server (#19010, @BenWilson2)
- [Tracking] [TypeScript SDK] Simplify Databricks auth by delegating to Databricks SDK (#19434, @simonfaltum)
- [Models] Safe model serialization: Support saving pytorch model via
torch.export.save, addskopsserialization format, and deprecate unsafe pickle/cloudpickle formats (#18759, #18832, #19692, #20151, @WeichenXu123)
Bug fixes:
- [Gateway] Fix Anthropic and Gemini streaming for LiteLLM providers (#20398, @TomeHirata)
- [Build] Include git submodule contents in Python package build (#20394, @copilot-swe-agent)
- [Tracing] Fix duplicate traces in semantic kernel autolog (#20206, @harupy)
- [Tracing] Fix Claude autolog to prioritize settings.json over OS environment variables (#20376, @alkispoly-db)
- [Evaluation] Fix temperature/json issues with
ConversationSimulatoron managed (#20236, @xsh310) - [Tracing / UI] Add support for OpenAI function calling inputs in chat UI parsing (#20058, @daniellok-db)
- [Tracking] Update checking code for pickle deserialization (#20267, @WeichenXu123)
- [Gateway] Fix Vertex AI model configuration (#20242, @TomeHirata)
- [UI] Store gateway<>scorer binding correctly (#20176, @TomeHirata)
- [Evaluation] Support
SparkDFtrace handling in eval (#20207, @BenWilson2) - [Evaluation] Fix tool name extraction for tool call correctness (#20201, @smoorjani)
- [Prompts] Fix scorers issue in metaprompting (#20173, @chenmoneygithub)
- [UI] Propagate Run id context to Assistant (#20138, @joelrobin18)
- [Model Registry] Allow for model registration to use KMS auth from different workspace (#20156, @BenWilson2)
- [UI] Improve scorer trace picker UX and validation (#20178, @danielseong1)
- [Evaluation] Improve
MemAlignoptimizer for incremental judge alignment (#20049, @veronicalyu320) - [Evaluation] Fix bug with max tokens using max output tokens (#20174, @smoorjani)
- [Evaluation] Fix a race condition bug when using DF inputs for genai eval (#20079, @BenWilson2)
- [Tracking] Fix
DATABRICKS_CONFIG_PROFILEenv var detection when fetching databricks credentials (#20112, @daniellok-db) - [Gateway] Move gateway invocation validation to fastapi middleware (#20111, @TomeHirata)
- [Prompts] Fix the length check in
mlflow.genai.optimize_prompts()(#19993, @chenmoneygithub) - [UI] Fix trace selection not registering in SelectTracesModal (#20099, @joelrobin18)
- [UI] Fix LimitOverrunError in Assistant streaming (#20078, @joelrobin18)
- [Tracing] CC Token usage (#20022, @joelrobin18)
- [Gateway] Remove MLflow-specific
auth_modefromLiteLLMConfig(#20059, @TomeHirata) - [UI] Assistant UI fix for dark theme (#20056, @joelrobin18)
- [Tracing] Isolate runtime context between opentelemetry and mlflow (#19797, @B-Step62)
- [UI] Prevent spurious 404 requests for relative image URLs in markdown (#20003, @harupy)
- [Tracing] Support MLflow tracing with OpenTelemetry auto-instrumentation (#19501, @serena-ruan)
- [UI] [UI] Fix session selector table column resizing and link behavior (#19927, @danielseong1)
- [Gateway] Add Azure provider support in gateway configuration (#19933, @TomeHirata)
- [Gateway] Propagate extra auth config to LiteLLM provider (#19931, @TomeHirata)
- [Evaluation / UI] Add missing retrieval context error for retrieval scorers (#19895, @danielseong1)
- [Evaluation / UI] Improve trace selection UX in scorer/judge UI (#19913, @danielseong1)
- [Model Registry / Models] Fix
infer_code_pathsto capture transitive imports of functions/classes (#19814, @copilot-swe-agent) - [Tracking] fix for addressing rest api call latency in databricks job run (#19886, @WeichenXu123)
- [UI] Enable {{trace}} variable support in sample judge evaluation (#19851, @danielseong1)
- [Scoring] Check security before extracting tar file (#19557, @WeichenXu123)
- [Gateway] Fix authorization header duplication (#19853, @TomeHirata)
- [Gateway] Fix Gateway error handling to translate
MlflowExceptiontoHTTPException(#19728, @danielseong1) - [Gateway] Remove
gateway_deprecateddecorator - AI Gateway is not deprecated (#19821, @copilot-swe-agent) - [Tracking] Make local artifact location creation lazy to support read-only proxy environments (#19678, @BenWilson2)
- [Evaluation] fixed databricks hosted llm failure due to
response_schemainjection (#19741, @sinanshamsudheen) - [Evaluation] Add
@overloadannotations to@scorerdecorator for proper type inference (#19570, @mr-brobot) - [Tracking] Add debug logging for 500 errors in
catch_mlflow_exception(#19781, @harupy) - [Tracing] [Bug fix] Support search traces by string feedback / expectation values (#19719, @dbczumar)
- [Tracing / UI] Fix scorer creation UX issues (#19756, @danielseong1)
- [Evaluation] Fix
KnowledgeRetentionmodel parameter not propagating to inner scorer (#19753, @danielseong1) - [Tracking] [BUG]
serve-artifactsis not enabled in docker-compose #19700 (#19701, @zjffdu) - [Tracing] Fix type signature loss in
@trace_disableddecorator (#19569, @mr-brobot) - [Tracking] Fix: Return 400 instead of 500 for invalid experiment_id (#19655, @copilot-swe-agent)
- [Models] Fix schema enforcement for pandas
StringDtype(#19518, @harupy) - [Tracing] Fix Python 3.12
DeprecationWarningfromgenerator.throw()in tracing (#19629, @mr-brobot) - [Evaluation] Fix structured outputs for databricks serving endpoints (#19572, @smoorjani)
- [Models / Scoring] Add dict to
PyFuncOutputtype alias forResponsesAgent/ChatAgent/ChatModel(#19560, @copilot-swe-agent) - [Tracking] Fix
enable_git_model_versioningto work from subdirectories (#19529, @copilot-swe-agent)
Documentation updates:
- [Docs] fix: Remove
multi_classargument from scikit-learn'sLogisticRegressionin docs (#20266, @SOORAJTS2001) - [Docs] Add doc for distributed tracing (#20027, @WeichenXu123)
- [Docs] Add Judge Builder UI documentation (#20163, @danielseong1)
- [Docs] Add framework integration examples for AI Gateway query-endpoint page (#20137, @TomeHirata)
- [Docs] Add "Evaluation Examples" article (#19722, @achen530)
- [Docs] [1/3] Add gateway tracing guide for LiteLLM, OpenRouter, and Vercel AI Gateway (#20031, @B-Step62)
- [Docs] Update prompt optimization doc to include metaprompting (#19966, @chenmoneygithub)
- [Docs] Reorganize gateway page structure (#19968, @TomeHirata)
- [Build / Docs] Fix broken auth REST API documentation links (#19872, @copilot-swe-agent)
- [Docs] Add setup and query documentation for new AI Gateway (#19804, @TomeHirata)
- [Docs] Add additional eval dataset serialization examples (#19697, @BenWilson2)
- [Docs] ML-60766: Add dataset schema from managed content to SDK reference page (#19676, @achen530)
- [Docs / Prompts] Fix duplicate tags argument in
register_promptdocumentation example (#19591, @copilot-swe-agent) - [Docs] Fix ML-59546 eval quickstart links to wrong place, add notebook version of eval quickstart (#19511, @achen530)
- [Docs] Add documentation for
KnowledgeRetentionscorer (#19478, @alkispoly-db)
Small bug fixes and documentation updates:
#20406, #20122, #20317, #20333, #20361, #20274, #20362, #20249, #20169, #20345, #20252, #20314, #20214, #20215, #20210, #20212, #20142, #20183, #20121, #20141, #20140, #20124, #20073, #20062, #20065, #19893, #19912, #19464, #19857, #19401, #19600, #19555, #19400, #19392, #19393, @B-Step62; #20323, #20263, #19982, #20218, #20143, #20146, #20145, #20064, #20117, #20144, #20110, #20050, #20017, #20116, #20118, #19989, #19953, #19836, #19915, #19955, #19952, #19940, #19939, #19938, #19937, #19877, #19874, #19869, #19867, #19865, #19837, #19835, #19834, #19864, #19873, #19833, #19825, #19876, #19799, #19798, #19793, #19771, #19770, #19635, #19634, #19633, #19632, #19624, #19622, #19621, #19620, #19631, #19619, #19747, #19609, #19608, #19607, #19606, #19604, #19603, #19602, #19601, #19588, #19587, #19581, #19585, #19610, #19590, #19580, #19579, #19578, #19577, #19576, #19234, @serena-ruan; #20378, #20385, #20205, #20237, #20193, #20171, #20155, #20170, #20132, #20097, #20100, #20101, #19736, #19717, #19716, #19759, #19718, #19714, #19713, #19712, #19711, #19840, #19710, #19709, #19708, #19777, #19707, @dbczumar; #20387, #19981, #19964, @bbqiu; #20390, #20334, #20208, #19978, #19980, #19875, #19854, #19816, #19815, #19796, #19806, #19785, #19789, #19769, #19748, #19773, #19782, #19706, #19523, #19505, #19450, #19482, #19458, #19433, #19431, #19455, #19417, #19426, #19424, @harupy; #20355, #20245, #20120, #20229, #20114, #20053, #20012, #19972, #20002, #19991, #19990, #19977, #19986, #19985, #19967, #19957, #19960, #19954, #19945, #19941, #19934, #19917, #19916, #19905, #19904, #19903, #19900, #19899, #19897, #19894, #19892, #19890, #19888, #19887, #19861, #19828, #19818, #19803, #19802, #19791, #19788, #19795, #19790, #19786, #19783, #19767, #19768, #19746, #19735, #19733, #19732, #19726, #19561, #19549, #19544, #19543, #19510, #19486, #19487, #19463, #19871, @copilot-swe-agent; #20308, #20264, #20109, #20181, #20180, #20177, #20134, #20107, #20015, #20007, #20008, #19930, #20006, #20005, #19965, #19942, #19944, #19950, #19936, #19947, #19948, #19946, #19870, #19824, #19823, #19856, #19863, #19858, #19860, #19849, #19822, #19765, #19792, #19764, #19763, #19618, #19453, #19452, #19404, #19390, #19290, @TomeHirata; #20350, #20203, #19675, #19677, #19674, #19476, #19447, @BenWilson2; #20286, #20157, #20051, #20216, #20200, #20213, #20194, #20072, #20195, #20175, #20039, #19844, #19935, #19696, #19451, #19409, @smoorjani; #20209, #20131, #19742, #19969, #19734, #19480, #19351, @daniellok-db; #20204, #20164, #20192, #19997, #19925, #19850, #19914, #19774, #19721, #19673, #19623, #19668, #19496, #19554, #19471, @danielseong1; #20037, #19884, #19846, #19843, #19813, #19454, #19391, #19322, #19388, #19307, #19382, @xsh310; #20130, @iyashk; #20147, #20030, #19962, #19826, @kevin-lyn; #20108, #20071, #19743, #20045, #20042, #19959, #19880, @SomtochiUmeh; #20025, #19662, #19749, #19738, #19419, @WeichenXu123; #19847, @jaceklaskowski; #19820, @Abhiii47; #19800, @shreenidhi2205; #19703, #19693, #19689, #19688, #19664, #19663, #19660, #19534, #19533, #19532, #19531, @hubertzub-db; #19652, @AMRUTH-ASHOK; #19493, #19495, @alkispoly-db; #16372, @mohammadsubhani; #19522, @pmeier