github evalplus/evalplus v0.3.1
EvalPlus v0.3.1

17 hours ago

For the past 6+ months, we have been actively maintaining and improving the EvalPlus repository. Now we are thrilled to announce a new release!

🔥 EvalPerf for Code Efficiency Evaluation

Based on our COLM'24 paper, we integrated the EvalPerf dataset into the EvalPlus repository.
EvalPerf is a dataset curated using the Differential Performance Evaluation methodology proposed by the paper, which argues that effective code efficiency evaluation requires:

  • Performance-exercising tasks -- our tasks are testified to be code efficiency challenging!
  • Performance-exercising inputs -- For each task, we generate performance-challenging test input!
  • Compound metric: Differential Performance Score (DPS) -- inspired by LeetCode's efficiency ranking of submissions, it tells conclusions like "your submission can outperform 80% of LLM solutions..."

The EvalPerf dataset initially has 118 coding tasks^ (a subset of the latest HumanEval+ and MBPP+) -- running EvalPerf is as simple as running the following commands:

pip install "evalplus[perf,vllm]" --upgrade
# Or: pip install "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" --upgrade

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

At evaluation time we by default perform the following steps:

  1. Correctness sampling: We sample LLMs for 100 solutions (n_samples) and perform correctness checking
  2. Efficiency evaluation: For tasks with 10+ passing solutions, we evaluate the code efficiency of (at most 20) passing solutions:
    • Primitive metric: # CPU instructions
    • We profile the # CPU instructions over (i) new LLM solutions; and (ii) representative performance reference solutions; running through the performance-challenging test input
    • We match the profiled new solution to the reference solution with comparative code efficiency as calculate $DPS$ and $DPS_{norm}$
    • e.g., Given 10 reference samples in 4 clusters: [3, 2, 3, 2], matching the 3rd cluster leads to $DPS = \frac{sample\ rank}{total\ samples} = \frac{3+2+3}{10}=80$% and $DPS_{norm} = \frac{cluster\ rank}{total\ cluster} = \frac{1+1+1}{4}=75$%

Collaborated work with @soryxie @FatPigeorz !

🔥 Command-line Interface (CLI) Simplification

We largely simplified the evaluation pipelines:

  • Previously: run evalplus.codegen, and then evalplus.sanitize, and then evalplus.evaluate with different parameters
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]            \
                  --backend vllm                        \
                  --greedy

evalplus.sanitize --samples [path/to/samples]

evalplus.evaluate --samples [path/to/samples]
  • Now:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Other notable updates

  • Sanitizer improvements (#189, #190) -- thanks to @Co1lin
  • Fixing edge case of is_float (#196)
  • HumanEval+ maintenance: v0.1.10 by improving contracts & oracle (#186, #201) -- thanks to @Kristoff-starling @Co1lin
  • MBPP+ maintenance: v0.2.1 by improving contracts & oracle (#211, #212)
  • Default behavior change: code generation results saved as .jsonal rather than massive individual files and folders
  • Using the official tree-sitter package and Python binary
  • Prompt: adding a new liner after stripping the prompt as some models are more familiar with """\n over """
  • Configurable maximum evaluation process memory via environmental variable EVALPLUS_MAX_MEMORY_BYTES
  • When sampling size per task > 1, batch size is automatically set as min(n_samples, 32) if --bs is not set
  • Sanitizer behavior: when the code is too broken to be sanitized, return the broken code rather than an empty string for debuggability.
  • vLLM: automatic prefix caching is enabled to accelerate sampling (hopefully)
  • Setting top_p = 0.95 for OpenAI, Google, and Anthropic backends
  • New arguments: --trust-remote-code

PyPI: https://pypi.org/project/evalplus/0.3.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.3.1/images/sha256-26b118098bef281fe8dfe999bf05f1d5b45374b4e6c00161ec0f30592aef4740

^In our COLM paper, we presented 121 tasks based on the February version of MBPP+ (v0.1.0) which by then there were 399 MBPP+ tasks -- in MBPP+ (v0.2.0) we removed some broken tasks (399 -> 378) leading a slight cut in the number of EvalPerf tasks as well.

^We skipped/yanked the release of v0.3.0 and directly released v0.3.1 due to broken dependency in v0.3.0.

Don't miss a new evalplus release

NewReleases is sending notifications on new releases.