github ianarawjo/ChainForge v0.1.6
v0.1.6: OpenAI evals

latest releases: v0.3.6, v0.3.5, v0.3.4...
2 years ago

Added 188 OpenAI Evals to Example Flows

We've added 188 example flows generated directly from OpenAI evals benchmarks.
In Example Flows, navigate to the "OpenAI Evals" tab, and click the benchmark you wish to load:

Screen.Recording.2023-06-15.at.3.49.32.PM.mov

The code in each Evaluator is the appropriate code for each evaluation, as referenced from the OpenAI eval-templates doc.

Example: Tetris problems

For example, I was able to compare GPT-4's performance on tetris problems with GPT3.5, simply by loading the eval, adding GPT-4, and pressing run:

Screen Shot 2023-06-15 at 4 10 36 PM

I was curious whether the custom system message had any effect on GPT3.5's performance, so I added a version without it, and in 5 seconds found out that the system message had no effect:

Screen Shot 2023-06-15 at 4 13 38 PM

Supported OpenAI evals

A large subset of OpenAI evals are supported. We currently display OpenAI evals with:

  • a common system message
  • a single 'turn' (prompt)
  • evaluation types of 'includes', 'match', and 'fuzzy match',
  • and a reasonable number of prompts (e.g., spanish-lexicon, which is not included, has 53,000 prompts)

We hope to add those with model evaluations (e.g., Chain-of-thought prompting) in the near future.

The cforge flows were precompiled from the oiaevals registry. To save space, the files are not included in the PyPI chainforge package, but rather fetched from GitHub on an as-needed basis. We precompiled the evals to avoid forcing users to install OpenAI evals, as it requires Git LFS, Python 3.9+, and a large number of dependencies.

Note finally that responses are not cache'd for these flows, unlike the other examples --you will need to query OpenAI models yourself to run them.


Minor Notes

This release also:

  • Changed Textareas to contenteditable p tags inside Tabular Data Nodes. Though this compromises usability slightly, there is a huge gain in performance when loading large tables (e.g., 1000 rows or more), which is required for some OpenAI evals in the examples package.
  • Fixed a bug in VisNode where a plot was not displaying when a single LLM was present, the number of prompt variables >= 1, and no variables were selected

If you run into any problems using OpenAI evals examples, or with any other part of CF, please let us know.

We could not manually test all of the new example flows, due to how many API calls would be required. Happy ChainForging!

Don't miss a new ChainForge release

NewReleases is sending notifications on new releases.