Here are the script and requirements.txt for evaluating language models using WWB. This particular example considers models tuned on grammatical error correction.
The script selects grammatically not correct sentences from jhu-clsp/jfleg
(refer to grammar_prompts.csv
).
Then it evaluates reference FP32 model (e.g. "pszemraj/bart-base-grammar-synthesis") and creates a reference output: sentences without grammar errors.
And finally, a target model for evaluation is scored against the received outputs from the reference model.
In this example, the target model is NF4-quantized version of the given FP32 model.
python -m venv env
source env/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt
python eval_grammar_corrector.py
Expected output:
similarity FDT SDT FDT norm SDT norm
0 0.973024 16.56 0.64 0.846626 0.033268