Unverified Commit 6cc41d34 authored by Saibo-creator's avatar Saibo-creator Committed by GitHub
Browse files

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865)



* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this
Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------
Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>
parent 773dcd7f
...@@ -69,6 +69,7 @@ ...@@ -69,6 +69,7 @@
| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English | | [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English | | [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese | | [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
| [jsonschema_bench](jsonschema_bench/README.md) | Evaluate the ability of LLMs to generate JSON objects that conform to a given JSON schema, including API, configuration files, and other structured data formats. | JSON |
| [kbl](kbl/README.md) | Korean Benchmark for Legal Language Understanding. | Korean | | [kbl](kbl/README.md) | Korean Benchmark for Legal Language Understanding. | Korean |
| [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean | | [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean | | [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean |
......
# JSONSchema Bench
## Tasks
- `jsonschema_bench_easy`, corresponding to the `github_easy` split of the original paper
- `jsonschema_bench_medium`, corresponding to the `github_medium` split of the original paper
- `jsonschema_bench_hard`, corresponding to the `github_hard` split of the original paper
Use `jsonschema_bench` tag to run all three tasks.
## Metrics
The JSONSchema Bench tasks are evaluated using the following two metrics:
- `json_validity`: This metric checks whether the generated output is valid JSON. It is a binary metric, where 1 indicates valid JSON and 0 indicates invalid JSON. We use `json` package to check the validity of the generated output.
- `schema_compliance`: This metric checks whether the generated output complies with the provided JSON schema. It is also a binary metric, where 1 indicates compliance and 0 indicates non-compliance. We use the `jsonschema` package to check the compliance of the generated output with the provided JSON schema.
## Dependencies
The JSONSchema Bench tasks require the `jsonschema` library to be installed. You can install it using pip:
```bash
pip install jsonschema\[format\]
```
The `format` extra is required to support the `format` keyword in JSON Schema, which is used in the tasks.
## Sequence Length
The `easy` task requires a context window of 2K tokens, the `medium` task requires a context window of 3K tokens, and the `hard` task requires a context window of 10K tokens ( the exact number will vary depending on the tokenizer used, but 10K tokens is a good estimate).
If you don't have enough memory to run the `hard` task, you can use the `--max_length` flag to reduce the context window size but this will truncate the schema and will lead to lower performance.
## Usage
Here is an example of how to run 10 instances of the `jsonschema_bench_easy` task :
```bash
lm_eval \
--model hf --gen_kwargs max_new_tokens=1024 \
--model_args pretrained=meta-llama/Llama-3.2-1B-Instruct,parallelize=True\
--tasks jsonschema_bench_medium \
--batch_size auto \
--limit 10 \
--apply_chat_template \
--fewshot_as_multiturn
```
The expected results is
```
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|-----------------------|------:|------|-----:|-----------------|---|----:|---|-----:|
|jsonschema_bench_medium| 0.1|none | 2|json_validity |↑ | 1.0|± |0.0000|
| | |none | 2|schema_compliance|↑ | 0.2|± |0.1333|
```
## Dataset
Available at [HF hub](https://huggingface.co/datasets/epfl-dlab/JSONSchemaBench)
## Leaderboard
We provide a [leaderboard](https://github.com/epfl-dlab/jsonschemabench-leaderboard) to track the progress of LLMs on the JSONSchema Bench tasks.
We welcome contributions to the leaderboard via pull requests.
## Paper
JGenerating Structured Outputs from Language Models: Benchmark and Studies[https://arxiv.org/abs/2501.10868]
Homepage: https://github.com/guidance-ai/jsonschemabench
## Citation
```
@misc{geng2025jsonschemabench,
title={Generating Structured Outputs from Language Models: Benchmark and Studies},
author={Saibo Geng and Hudson Cooper and Michał Moskal and Samuel Jenkins and Julian Berman and Nathan Ranchin and Robert West and Eric Horvitz and Harsha Nori},
year={2025},
eprint={2501.10868},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.10868},
}
```
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
tag:
- jsonschema_bench
task: jsonschema_bench_easy
dataset_path: epfl-dlab/JSONSchemaBench
dataset_name: Github_easy
training_split: train
validation_split: valid
test_split: test
description: "Generate a JSON object that matches the following JSON schema."
doc_to_text: 'JSON schema: {{json_schema}}\n\nJSON object: '
doc_to_target: '{{json_object if json_object is defined else json_schema}}' # here we use the json_schema as the target at test time for evaluation
output_type: generate_until
metric_list:
- metric: !function metrics.json_validity
aggregation: mean
higher_is_better: True
- metric: !function metrics.schema_compliance
aggregation: mean
higher_is_better: True
metadata:
version: 0.1
fewshot_split: null
num_fewshot: 2
fewshot_config:
sampler: first_n
samples:
- json_schema: "{
\"$schema\": \"http://json-schema.org/draft-04/schema#\",
\"definitions\": {
\"address1\": {
\"type\": \"string\"
},
\"address2\": {
\"type\": \"string\"
},
\"city\": {
\"type\": \"string\"
},
\"country\": {
\"type\": \"string\"
},
\"postalCode\": {
\"type\": \"string\"
},
\"state\": {
\"type\": \"string\"
}
},
\"description\": \"A simple address schema\",
\"properties\": {
\"address1\": {
\"$ref\": \"#/definitions/address1\"
},
\"address2\": {
\"$ref\": \"#/definitions/address2\"
},
\"city\": {
\"$ref\": \"#/definitions/city\"
},
\"country\": {
\"$ref\": \"#/definitions/country\"
},
\"postalCode\": {
\"$ref\": \"#/definitions/postalCode\"
},
\"state\": {
\"$ref\": \"#/definitions/state\"
}
},
\"type\": \"object\"
}"
json_object: "{
\"address1\": \"123 Main Street\",
\"address2\": \"Apt 4B\",
\"city\": \"Seattle\",
\"country\": \"USA\",
\"postalCode\": \"98101\",
\"state\": \"WA\"
}"
- json_schema: "{
\"$schema\": \"http://json-schema.org/draft-06/schema#\",
\"definitions\": {
\"ElementType\": {
\"enum\": [
\"component\",
\"directive\"
],
\"type\": \"string\"
},
\"SelectorChange\": {
\"properties\": {
\"remove\": {
\"description\": \"Remove directive/component\",
\"type\": \"boolean\"
},
\"replaceWith\": {
\"description\": \"Replace original selector with new one\",
\"type\": \"string\"
},
\"selector\": {
\"description\": \"Original selector to apply change to\",
\"type\": \"string\"
},
\"type\": {
\"$ref\": \"#/definitions/ElementType\",
\"description\": \"Type of selector the change applies to - either component or directive\"
}
},
\"required\": [
\"selector\",
\"type\"
],
\"type\": \"object\"
}
},
\"properties\": {
\"changes\": {
\"description\": \"An array of changes to component/directive selectors\",
\"items\": {
\"$ref\": \"#/definitions/SelectorChange\"
},
\"type\": \"array\"
}
},
\"required\": [
\"changes\"
],
\"type\": \"object\"
}"
json_object: "{
\"changes\": [
{
\"selector\": \"app-root\",
\"type\": \"component\",
\"remove\": false,
\"replaceWith\": \"new-root\"
},
{
\"selector\": \"my-directive\",
\"type\": \"directive\",
\"remove\": true,
\"replaceWith\": \"new-directive\"
}
]
}"
- json_schema: "{
\"additionalProperties\": false,
\"description\": \"Schema for tracking e-commerce transaction details and metadata.\",
\"properties\": {
\"store\": {
\"description\": \"The store or seller associated with the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"discountCode\": {
\"description\": \"Promotional code applied to the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"currencyCode\": {
\"description\": \"ISO 4217 currency code for the transaction.\",
\"maxLength\": 3,
\"minLength\": 3,
\"type\": \"string\"
},
\"transactionId\": {
\"description\": \"Unique identifier for the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"productList\": {
\"description\": \"Identifier for the product list from which the purchase was made.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"purchaseOption\": {
\"description\": \"Additional purchase options or preferences.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"totalAmount\": {
\"description\": \"Total revenue generated from the transaction.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
},
\"deliveryCharge\": {
\"description\": \"Shipping cost associated with the order.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
},
\"processStep\": {
\"description\": \"Current step in the purchase or checkout process.\",
\"maximum\": 2147483647,
\"minimum\": 0,
\"type\": \"integer\"
},
\"taxAmount\": {
\"description\": \"Total tax applied to the transaction.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
}
},
\"self\": {
\"format\": \"jsonschema\",
\"name\": \"transactionDataObject\",
\"vendor\": \"com.ecommerce.analytics.tracking\",
\"version\": \"1-0-0\"
},
\"type\": \"object\"
}"
json_object: "{
\"asset_id\": \"minecraft:trim_pattern\",
\"description\": {
\"color\": \"#FFAA00\",
\"translate\": \"trim_pattern.description\"
},
\"template_item\": \"minecraft:template_item\"
}"
- json_schema: "{
\"$comment\": \"https://minecraft.fandom.com/wiki/Data_Pack\",
\"$id\": \"https://json.schemastore.org/minecraft-damage-type.json\",
\"$schema\": \"http://json-schema.org/draft-07/schema#\",
\"description\": \"A damage type's for a Minecraft data pack config schema\",
\"properties\": {
\"death_message_type\": {
\"enum\": [
\"default\",
\"fall_variants\",
\"intentional_game_design\"
],
\"type\": \"string\"
},
\"effects\": {
\"enum\": [
\"hurt\",
\"thorns\",
\"drowning\",
\"burning\",
\"poking\",
\"freezing\"
],
\"type\": \"string\"
},
\"exhaustion\": {
\"type\": \"number\"
},
\"message_id\": {
\"type\": \"string\"
},
\"scaling\": {
\"enum\": [
\"never\",
\"always\",
\"when_caused_by_living_non_player\"
],
\"type\": \"string\"
}
},
\"required\": [
\"message_id\",
\"scaling\",
\"exhaustion\"
],
\"title\": \"Minecraft Data Pack Damage Type\",
\"type\": \"object\"
}"
json_object: "{
\"message_id\": \"minecraft:damage.message\",
\"scaling\": \"always\",
\"exhaustion\": 0.3,
\"death_message_type\": \"default\",
\"effects\": \"hurt\"
}"
- json_schema: "{
\"additionalProperties\": false,
\"description\": \"Schema for tracking e-commerce transaction details and metadata.\",
\"properties\": {
\"store\": {
\"description\": \"The store or seller associated with the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"discountCode\": {
\"description\": \"Promotional code applied to the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"currencyCode\": {
\"description\": \"ISO 4217 currency code for the transaction.\",
\"maxLength\": 3,
\"minLength\": 3,
\"type\": \"string\"
},
\"transactionId\": {
\"description\": \"Unique identifier for the transaction.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"productList\": {
\"description\": \"Identifier for the product list from which the purchase was made.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"purchaseOption\": {
\"description\": \"Additional purchase options or preferences.\",
\"maxLength\": 500,
\"type\": \"string\"
},
\"totalAmount\": {
\"description\": \"Total revenue generated from the transaction.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
},
\"deliveryCharge\": {
\"description\": \"Shipping cost associated with the order.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
},
\"processStep\": {
\"description\": \"Current step in the purchase or checkout process.\",
\"maximum\": 2147483647,
\"minimum\": 0,
\"type\": \"integer\"
},
\"taxAmount\": {
\"description\": \"Total tax applied to the transaction.\",
\"multipleOf\": 0.01,
\"type\": \"number\"
}
},
\"self\": {
\"format\": \"jsonschema\",
\"name\": \"transactionDataObject\",
\"vendor\": \"com.ecommerce.analytics.tracking\",
\"version\": \"1-0-0\"
},
\"type\": \"object\"
}"
json_object: "{
\"store\": \"TechGadgets Online\",
\"discountCode\": \"SUMMER20\",
\"currencyCode\": \"USD\",
\"transactionId\": \"TXN123456789\",
\"productList\": \"Best Sellers\",
\"purchaseOption\": \"Express Shipping\",
\"totalAmount\": 299.99,
\"deliveryCharge\": 5.99,
\"processStep\": 3,
\"taxAmount\": 20.50
}"
- json_schema: "{
\"properties\": {
\"date\": {
\"description\": \"The date of the meeting\",
\"type\": \"string\"
},
\"time\": {
\"description\": \"The time of the meeting\",
\"type\": \"string\"
},
\"participants\": {
\"description\": \"List of participants' emails\",
\"type\": \"array\",
\"items\": {
\"type\": \"string\"
}
}
},
\"required\": [
\"date\",
\"time\"
],
\"type\": \"object\"
}"
json_object: "{
\"date\": \"2024-09-30\",
\"time\": \"10:00 AM\",
\"participants\": [
\"alice@example.com\",
\"bob@example.com\"
]
}"
include: jsonschema_bench_easy.yaml
task: jsonschema_bench_hard
dataset_name: Github_hard
include: jsonschema_bench_easy.yaml
task: jsonschema_bench_medium
dataset_name: Github_medium
import ipaddress
import json
import logging
import uuid
from typing import Any, Dict
# check if jsonschema is installed
try:
import jsonschema
from jsonschema import Draft202012Validator, FormatChecker, ValidationError
except ImportError as e:
raise ImportError(
"jsonschema is not installed. Please install it using 'pip install jsonschema[format]'"
) from e
eval_logger = logging.getLogger(__name__)
def is_json_schema_valid(schema: dict):
"""
Check if a JSON schema is valid.
:param schema: A JSON schema.
:return: True if the schema is valid, False otherwise.
"""
try:
# Check if the schema is valid
jsonschema.Draft202012Validator.check_schema(schema)
return True
except jsonschema.SchemaError:
return False
# Initialize the FormatChecker
format_checker = FormatChecker()
# Add custom format checkers
@format_checker.checks("ipv4")
def ipv4_check(value):
ipaddress.IPv4Address(value)
@format_checker.checks("ipv6")
def ipv6_check(value):
ipaddress.IPv6Address(value)
@format_checker.checks("uuid")
def uuid_check(value):
uuid.UUID(value)
def schema_conform_with_format_checker(
instance: Dict[str, Any], schema: Dict[str, Any]
) -> bool:
"""
Validate a JSON instance against a schema with enhanced format checking.
:param schema: The JSON schema to validate against.
:param instance: The JSON instance to validate.
:raises ValidationError: If the validation fails.
"""
# first check if the schema is valid
if not is_json_schema_valid(schema):
raise ValidationError("The JSON schema is invalid.")
validator = Draft202012Validator(schema, format_checker=format_checker)
try:
validator.validate(instance)
except ValidationError as e:
raise ValidationError(e.message)
return True
def schema_compliance(references: list[str], predictions: list[str]) -> bool:
assert len(references) == 1, (
"We only have one reference for this task, which is the JSON schema."
)
assert len(predictions) == 1, (
"Currently, we don't support pass@k for JSON schema validation."
)
reference = references[0]
prediction = predictions[0] # Since predictions is a list of lists
json_schema = json.loads(reference.strip())
try:
json_obj = json.loads(prediction.strip().strip("```").strip("json"))
except json.JSONDecodeError:
return False
try:
schema_conform = schema_conform_with_format_checker(json_obj, json_schema)
except Exception as e:
eval_logger.error(f"Error: {e}")
return False
return schema_conform
def json_validity(references: list[str], predictions: list[str]) -> bool:
assert len(predictions) == 1, (
"Currently, we don't support pass@k for JSON schema validation."
)
prediction = predictions[0] # Since predictions is a list of lists
try:
json.loads(prediction.strip().strip("```").strip("json").strip())
except json.JSONDecodeError:
return False
return True
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment