git clone https://github.com/bigcode-project/bigcode-evaluation-harness
pip install -e .
mistralai/Mistral-7B-v0.1 should result in "pass@1": 0.29878
paper: 30.5%, 0.7% gap
accelerate launch $WORKDIR/main.py \
--model $LMID \
--max_length_generation 512 \
--tasks humaneval \
--batch_size $BS \
--n_samples 1 \
--no_do_sample \
--temperature 0.0 \
--top_p 1.0 \
--precision bf16 \
--allow_code_execution \
--use_auth_token \
--metric_output_path $OUTDIR/evaluation_results.json \
--save_generations \
--save_generations_path $OUTDIR/generated.json \
# --limit 5
# unsure about how to disable top-p directly, also wonder if it is ever used since we do --no_do_sample
# when --no_do_sample, --n_samples will result in n duplicate, it is greedy decoding after all, so set it to 1;
# beam search might be different - need to find out
# BS must be 1 for greedy case per current implementation.
mistralai/Mistral-7B-v0.1
"pass@1": 0.2825609756097561,
"pass@10": 0.41052352768889244
accelerate launch $WORKDIR/main.py \
--model $LMID \
--max_length_generation 512 \
--tasks humaneval \
--batch_size $BS \
--n_samples 50 \
--do_sample \
--temperature 0.2 \
--top_p 0.95 \
--precision bf16 \
--allow_code_execution \
--use_auth_token \
--metric_output_path $OUTDIR/evaluation_results.json \
--save_generations \
--save_generations_path $OUTDIR/generated.json \
# --limit 5
- open codegen leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
- Great Discussion on Mistral7B reproducibility of HumanEval bigcode-project/bigcode-evaluation-harness#165
- Quick read on pass@k metric https://deepgram.com/learn/humaneval-llm-benchmark