What is this?
In addition to the initial version of EvalPlus source-code, we release the pre-generated code of LLMs on HumanEval+ (also applicable for base HumanEval) and regularized ground-truth solutions. With these we hope to accelerate future research where research may try to reuse our pre-generated code instead of generating them from scratch.
${MODEL_NAME}_temp_${TEMPERATURE}.zip
: LLM-produced program samplesHumanEvalPlusGT.zip
: The re-implemented ground-truth solutions
Data sources
The configuration of the pre-generated code follows our pre-print paper: https://arxiv.org/abs/2305.01210
- We evaluated it over:
- x 14 models (10 model types)
- x 5 temperature settings including zero temperature (for greedy decoding) as well as
{0.2, 0.4, 0.6, 0.8}
- x 200 code samples used random sampling (i.e., non-greedy decoding settings)
- We use nucleus sampling with top p = 0.95 for all hugging-face based model
- Codegen6B and Codegen16B is accelerated by FauxPilot (thanks!)
Evaluated results
We draw the results from the samples and test-cases from the base HumanEval and our enhanced HumanEval+:
Call for contribution
We also encourage open-source developers to contributing to LLM4Code research by: (i) reproducing and validating our results; (ii) uploading LLM-generated samples and reproducing the results of new models; and of course (iii) trying out our enhanced dataset to get more accurate and trustworthy results!