Arize-ai/phoenix v0.0.24 on GitHub

This release updates Phoenix's capabilities for cluster-based analysis - providing more metrics to help you assess the performance and data quality of your unstructured data.

✨ Cluster Performance Metrics

Clusters can now be analyzed for model performance degradation! Our new release includes accuracy_score as a model performance metric. Using accuracy as the base metric on the embedding projection allows you to drill into clusters that map to bad predictions quicker than ever before. Finding pockets of bad performance is as simple as picking the metric and sorting the clusters by worst performing. If you are using Phoenix to identify production data that should be re-labeled and fed back into your training pipeline, this is the feature for you.

cluster_performance.mp4

✨ Cluster Data Quality / Custom Metrics

Clusters can now be analyzed via ad-hoc metrics! You can now calculate the average of any numeric feature, tag, prediction, or actual sent into Phoenix. This means you can now find "low-quality" clusters via the heuristic of your choosing! Below is an example of how precision@k for document retrieval (from a vector store) is used to identify clusters of chatbot queries that are failing to provide a good answer. The neat thing about this feature is that you can use Phoenix to build your own EDA heuristic! Care about rouge score or LLM-assisted evaluations? You can now use these to analyze your embeddings and to discover anomalies by simply sorting your clusters! This feature gives you, the data scientist, a powerful tool to formulate bespoke heuristics for identifying clusters of low performance, quality, and/or drift. We hope you like it!

context_retrieval.mp4

What's Changed

docs: dolly vs. pythia by @axiomofjoy in #818
feat: data quality metric by cluster by @RogerHYang in #804
feat(dimensions): Add the ability to filter by data_type by @mikeldking in #822
feat(embeddings): metric selector by @mikeldking in #821
fix: nan bug for gql by @RogerHYang in #832
feat: add stand-alone clusters endpoint for GraphQL query by @RogerHYang in #831
feat(embeddings): cluster sorting by @mikeldking in #830
chore: make placeholder text more obvious by @mikeldking in #833
fix: change float16 to float32 as dtype for the nan series by @RogerHYang in #837
fix: return nan on NotImplementedError (when binning on np.float16) by @RogerHYang in #838
docs: sync 06-09-2023 by @mikeldking in #840
feat(gql): add prediction id to event metadata by @RogerHYang in #843
fix: coerce lists to arrays by @RogerHYang in #845
feat: add performance metrics to each cluster by @RogerHYang in #828
feat: accuracy timeseries by @RogerHYang in #842
feat(embeddings): cluster data quality metrics by @mikeldking in #846
docs: Update DEVELOPMENT.md with pypi publish changes. by @mikeldking in #849
fix(embeddings): always place clusters with empty metrics at the bottom by @mikeldking in #850
fix: show not found error when server is no longer running by @mikeldking in #853
fix: guess whether a column contains any vector or all scalars by @RogerHYang in #854
chore: camel-case metrics by @mikeldking in #856
fix: skip empty interval bin with infinity endpoints (when all data are missing values) by @RogerHYang in #857
feat(embeddings): cluster performance metrics by @mikeldking in #855
fix(embeddings): force re-render clusters when opacity changes by @mikeldking in #858
feat: show prediction id in selection details by @RogerHYang in #860
fix: hide data quality metrics if empty by @mikeldking in #861
fix: use random init when spectral init (the default) cannot be used by @RogerHYang in #862
fix: replace NaT (Not a Time) with now (when dataset is empty) by @RogerHYang in #863
fix(ui): cleanup event details for llm use-case by @mikeldking in #865

Full Changelog: 0.0.23...v0.0.24