For the past 6+ months, we have been actively maintaining and improving the EvalPlus repository. Now we are thrilled to announce a new release!
🔥 EvalPerf for Code Efficiency Evaluation
Based on our COLM'24 paper, we integrated the EvalPerf dataset into the EvalPlus repository.
EvalPerf is a dataset curated using the Differential Performance Evaluation methodology proposed by the paper, which argues that effective code efficiency evaluation requires:
- Performance-exercising tasks -- our tasks are testified to be code efficiency challenging!
- Performance-exercising inputs -- For each task, we generate performance-challenging test input!
- Compound metric: Differential Performance Score (DPS) -- inspired by LeetCode's efficiency ranking of submissions, it tells conclusions like "your submission can outperform 80% of LLM solutions..."
The EvalPerf dataset initially has 118 coding tasks^ (a subset of the latest HumanEval+ and MBPP+) -- running EvalPerf is as simple as running the following commands:
pip install "evalplus[perf,vllm]" --upgrade
# Or: pip install "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" --upgrade
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
At evaluation time we by default perform the following steps:
-
Correctness sampling: We sample LLMs for 100 solutions (
n_samples
) and perform correctness checking -
Efficiency evaluation: For tasks with 10+ passing solutions, we evaluate the code efficiency of (at most 20) passing solutions:
- Primitive metric: # CPU instructions
- We profile the # CPU instructions over (i) new LLM solutions; and (ii) representative performance reference solutions; running through the performance-challenging test input
- We match the profiled new solution to the reference solution with comparative code efficiency as calculate $DPS$ and $DPS_{norm}$
- e.g., Given 10 reference samples in 4 clusters: [3, 2, 3, 2], matching the 3rd cluster leads to $DPS = \frac{sample\ rank}{total\ samples} = \frac{3+2+3}{10}=80$% and $DPS_{norm} = \frac{cluster\ rank}{total\ cluster} = \frac{1+1+1}{4}=75$%
Collaborated work with @soryxie @FatPigeorz !
🔥 Command-line Interface (CLI) Simplification
We largely simplified the evaluation pipelines:
- Previously: run
evalplus.codegen
, and thenevalplus.sanitize
, and thenevalplus.evaluate
with different parameters
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
evalplus.sanitize --samples [path/to/samples]
evalplus.evaluate --samples [path/to/samples]
- Now:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
Other notable updates
- Sanitizer improvements (#189, #190) -- thanks to @Co1lin
- Fixing edge case of
is_float
(#196) - HumanEval+ maintenance:
v0.1.10
by improving contracts & oracle (#186, #201) -- thanks to @Kristoff-starling @Co1lin - MBPP+ maintenance:
v0.2.1
by improving contracts & oracle (#211, #212) - Default behavior change: code generation results saved as
.jsonal
rather than massive individual files and folders - Using the official tree-sitter package and Python binary
- Prompt: adding a new liner after stripping the prompt as some models are more familiar with
"""\n
over"""
- Configurable maximum evaluation process memory via environmental variable
EVALPLUS_MAX_MEMORY_BYTES
- When sampling size per task > 1, batch size is automatically set as
min(n_samples, 32)
if--bs
is not set - Sanitizer behavior: when the code is too broken to be sanitized, return the broken code rather than an empty string for debuggability.
- vLLM: automatic prefix caching is enabled to accelerate sampling (hopefully)
- Setting
top_p = 0.95
for OpenAI, Google, and Anthropic backends - New arguments:
--trust-remote-code
PyPI: https://pypi.org/project/evalplus/0.3.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.3.1/images/sha256-26b118098bef281fe8dfe999bf05f1d5b45374b4e6c00161ec0f30592aef4740
^In our COLM paper, we presented 121 tasks based on the February version of MBPP+ (v0.1.0
) which by then there were 399 MBPP+ tasks -- in MBPP+ (v0.2.0
) we removed some broken tasks (399 -> 378) leading a slight cut in the number of EvalPerf tasks as well.
^We skipped/yanked the release of v0.3.0
and directly released v0.3.1
due to broken dependency in v0.3.0
.