tosh/pandafish-2-7b-32k-Nous.md

Created April 5, 2024 10:49

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/tosh/de1769c43db88d94353ca481f4bc418f.js"></script>
Save tosh/de1769c43db88d94353ca481f4bc418f to your computer and use it in GitHub Desktop.

Download ZIP

Raw

pandafish-2-7b-32k-Nous.md

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
pandafish-2-7b-32k	40.8	73.35	57.46	42.69	53.57

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	22.05	±	2.61
		acc_norm	19.69	±	2.50
agieval_logiqa_en	0	acc	35.94	±	1.88
		acc_norm	37.02	±	1.89
agieval_lsat_ar	0	acc	22.61	±	2.76
		acc_norm	24.35	±	2.84
agieval_lsat_lr	0	acc	40.00	±	2.17
		acc_norm	40.98	±	2.18
agieval_lsat_rc	0	acc	56.88	±	3.03
		acc_norm	55.39	±	3.04
agieval_sat_en	0	acc	72.82	±	3.11
		acc_norm	71.84	±	3.14
agieval_sat_en_without_passage	0	acc	45.63	±	3.48
		acc_norm	40.78	±	3.43
agieval_sat_math	0	acc	40.91	±	3.32
		acc_norm	36.36	±	3.25

Average: 40.8%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	57.25	±	1.45
		acc_norm	58.53	±	1.44
arc_easy	0	acc	84.22	±	0.75
		acc_norm	81.82	±	0.79
boolq	1	acc	86.36	±	0.60
hellaswag	0	acc	64.32	±	0.48
		acc_norm	82.95	±	0.38
openbookqa	0	acc	36.00	±	2.15
		acc_norm	46.20	±	2.23
piqa	0	acc	81.99	±	0.90
		acc_norm	83.30	±	0.87
winogrande	0	acc	74.27	±	1.23

Average: 73.35%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	41.13	±	1.72
		mc2	57.46	±	1.52

Average: 57.46%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	52.63	±	3.63
bigbench_date_understanding	0	multiple_choice_grade	69.38	±	2.40
bigbench_disambiguation_qa	0	multiple_choice_grade	45.35	±	3.11
bigbench_geometric_shapes	0	multiple_choice_grade	20.06	±	2.12
		exact_str_match	11.14	±	1.66
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	28.80	±	2.03
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	20.57	±	1.53
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	47.00	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	33.80	±	2.12
bigbench_navigate	0	multiple_choice_grade	51.80	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	64.75	±	1.07
bigbench_ruin_names	0	multiple_choice_grade	47.10	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	27.05	±	1.41
bigbench_snarks	0	multiple_choice_grade	67.96	±	3.48
bigbench_sports_understanding	0	multiple_choice_grade	63.49	±	1.53
bigbench_temporal_sequences	0	multiple_choice_grade	41.00	±	1.56
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	23.84	±	1.21
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	16.80	±	0.89
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	47.00	±	2.89

Average: 42.69%

Average score: 53.57%

Elapsed time: 02:27:40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment