Commit dd3b4f7a authored by Leo Gao's avatar Leo Gao
Browse files

evaloator: take min bootstrap steps for blue/chrf/ter

parent d67c77be
......@@ -95,7 +95,7 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this
stderr = lm_eval.metrics.stderr_for_metric(task.aggregation()[metric], bootstrap_iters=1000 if metric in ["bleu", "chrf", "ter"] else bootstrap_iters)
stderr = lm_eval.metrics.stderr_for_metric(task.aggregation()[metric], bootstrap_iters=min(bootstrap_iters, 1000) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters)
if stderr is not None:
results[task_name][metric + "_stderr"] = stderr(items)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment