Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865)

* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865)
* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
6cc41d34 · Saibo-creator · GitHub · 773dcd7f · 6cc41d34 · 6cc41d34
Unverified Commit 6cc41d34 authored Apr 02, 2025 by Saibo-creator Committed by GitHub Apr 02, 2025
6 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -69,6 +69,7 @@
 | [ifeval](ifeval/README.md)                                               | Interactive fiction evaluation tasks for narrative understanding and reasoning.                                                                                                                                                                                                                                                        | English                                                                                                               |
 | [inverse_scaling](inverse_scaling/README.md)                             | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse.                                                                                                                                                                                                            | English                                                                                                               |
 | [japanese_leaderboard](japanese_leaderboard/README.md)                   | Japanese language understanding tasks to benchmark model performance on various linguistic aspects.                                                                                                                                                                                                                                    | Japanese                                                                                                              |
+| [jsonschema_bench](jsonschema_bench/README.md)                           | Evaluate the ability of LLMs to generate JSON objects that conform to a given JSON schema, including API, configuration files, and other structured data formats. | JSON                                                                                                                       |
 | [kbl](kbl/README.md)                                                     | Korean Benchmark for Legal Language Understanding.                                                                                                                                                                                                                                                                                     | Korean                                                                                                                |
 | [kmmlu](kmmlu/README.md)                                                 | Knowledge-based multi-subject multiple choice questions for academic evaluation.                                                                                                                                                                                                                                                       | Korean                                                                                                                |
 | [kobest](kobest/README.md)                                               | A collection of tasks designed to evaluate understanding in Korean language.                                                                                                                                                                                                                                                           | Korean                                                                                                                |

--- a/lm_eval/tasks/jsonschema_bench/README.md
+++ b/lm_eval/tasks/jsonschema_bench/README.md
+# JSONSchema Bench
+## Tasks
+- `jsonschema_bench_easy`, corresponding to the `github_easy` split of the original paper
+- `jsonschema_bench_medium`, corresponding to the `github_medium` split of the original paper
+- `jsonschema_bench_hard`, corresponding to the `github_hard` split of the original paper
+Use `jsonschema_bench` tag to run all three tasks.
+## Metrics
+The JSONSchema Bench tasks are evaluated using the following two metrics:
+- `json_validity`: This metric checks whether the generated output is valid JSON. It is a binary metric, where 1 indicates valid JSON and 0 indicates invalid JSON. We use `json` package to check the validity of the generated output.
+- `schema_compliance`: This metric checks whether the generated output complies with the provided JSON schema. It is also a binary metric, where 1 indicates compliance and 0 indicates non-compliance. We use the `jsonschema` package to check the compliance of the generated output with the provided JSON schema.
+## Dependencies
+The JSONSchema Bench tasks require the `jsonschema` library to be installed. You can install it using pip:
+```bash
+pip install jsonschema\[format\]
+```
+The `format` extra is required to support the `format` keyword in JSON Schema, which is used in the tasks.
+##  Sequence Length
+The `easy` task requires a context window of 2K tokens, the `medium` task requires a context window of 3K tokens, and the `hard` task requires a context window of 10K tokens ( the exact number will vary depending on the tokenizer used, but 10K tokens is a good estimate).
+If you don't have enough memory to run the `hard` task, you can use the `--max_length` flag to reduce the context window size but this will truncate the schema and will lead to lower performance.
+## Usage
+Here is an example of how to run 10 instances of the `jsonschema_bench_easy` task :
+```bash
+lm_eval \
+    --model hf --gen_kwargs max_new_tokens=1024 \
+    --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct,parallelize=True\
+    --tasks jsonschema_bench_medium \
+    --batch_size auto \
+    --limit 10 \
+    --apply_chat_template \
+    --fewshot_as_multiturn
+```
+The expected results is
+```
+|         Tasks         |Version|Filter|n-shot|     Metric      |   |Value|   |Stderr|
+|-----------------------|------:|------|-----:|-----------------|---|----:|---|-----:|
+|jsonschema_bench_medium|    0.1|none  |     2|json_validity    |↑  |  1.0|±  |0.0000|
+|                       |       |none  |     2|schema_compliance|↑  |  0.2|±  |0.1333|
+```
+## Dataset
+Available at [HF hub](https://huggingface.co/datasets/epfl-dlab/JSONSchemaBench)
+## Leaderboard
+We provide a [leaderboard](https://github.com/epfl-dlab/jsonschemabench-leaderboard) to track the progress of LLMs on the JSONSchema Bench tasks.
+We welcome contributions to the leaderboard via pull requests.
+## Paper
+JGenerating Structured Outputs from Language Models: Benchmark and Studies[https://arxiv.org/abs/2501.10868]
+Homepage: https://github.com/guidance-ai/jsonschemabench
+## Citation
+```
+@misc{geng2025jsonschemabench,
+      title={Generating Structured Outputs from Language Models: Benchmark and Studies},
+      author={Saibo Geng and Hudson Cooper and Michał Moskal and Samuel Jenkins and Julian Berman and Nathan Ranchin and Robert West and Eric Horvitz and Harsha Nori},
+      year={2025},
+      eprint={2501.10868},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2501.10868},
+}
+```
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/jsonschema_bench/jsonschema_bench_easy.yaml
+++ b/lm_eval/tasks/jsonschema_bench/jsonschema_bench_easy.yaml
+tag:
+  - jsonschema_bench
+task: jsonschema_bench_easy
+dataset_path: epfl-dlab/JSONSchemaBench
+dataset_name: Github_easy
+training_split: train
+validation_split: valid
+test_split: test
+description: "Generate a JSON object that matches the following JSON schema."
+doc_to_text: 'JSON schema: {{json_schema}}\n\nJSON object: '
+doc_to_target: '{{json_object if json_object is defined else json_schema}}' # here we use the json_schema as the target at test time for evaluation
+output_type: generate_until
+metric_list:
+  - metric: !function metrics.json_validity
+    aggregation: mean
+    higher_is_better: True
+  - metric: !function metrics.schema_compliance
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 0.1
+fewshot_split: null
+num_fewshot: 2
+fewshot_config:
+  sampler: first_n
+  samples:
+  - json_schema: "{
+      \"$schema\": \"http://json-schema.org/draft-04/schema#\",
+      \"definitions\": {
+          \"address1\": {
+              \"type\": \"string\"
+          },
+          \"address2\": {
+              \"type\": \"string\"
+          },
+          \"city\": {
+              \"type\": \"string\"
+          },
+          \"country\": {
+              \"type\": \"string\"
+          },
+          \"postalCode\": {
+              \"type\": \"string\"
+          },
+          \"state\": {
+              \"type\": \"string\"
+          }
+      },
+      \"description\": \"A simple address schema\",
+      \"properties\": {
+          \"address1\": {
+              \"$ref\": \"#/definitions/address1\"
+          },
+          \"address2\": {
+              \"$ref\": \"#/definitions/address2\"
+          },
+          \"city\": {
+              \"$ref\": \"#/definitions/city\"
+          },
+          \"country\": {
+              \"$ref\": \"#/definitions/country\"
+          },
+          \"postalCode\": {
+              \"$ref\": \"#/definitions/postalCode\"
+          },
+          \"state\": {
+              \"$ref\": \"#/definitions/state\"
+          }
+      },
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"address1\": \"123 Main Street\",
+      \"address2\": \"Apt 4B\",
+      \"city\": \"Seattle\",
+      \"country\": \"USA\",
+      \"postalCode\": \"98101\",
+      \"state\": \"WA\"
+    }"
+  - json_schema: "{
+      \"$schema\": \"http://json-schema.org/draft-06/schema#\",
+      \"definitions\": {
+          \"ElementType\": {
+              \"enum\": [
+                  \"component\",
+                  \"directive\"
+              ],
+              \"type\": \"string\"
+          },
+          \"SelectorChange\": {
+              \"properties\": {
+                  \"remove\": {
+                      \"description\": \"Remove directive/component\",
+                      \"type\": \"boolean\"
+                  },
+                  \"replaceWith\": {
+                      \"description\": \"Replace original selector with new one\",
+                      \"type\": \"string\"
+                  },
+                  \"selector\": {
+                      \"description\": \"Original selector to apply change to\",
+                      \"type\": \"string\"
+                  },
+                  \"type\": {
+                      \"$ref\": \"#/definitions/ElementType\",
+                      \"description\": \"Type of selector the change applies to - either component or directive\"
+                  }
+              },
+              \"required\": [
+                  \"selector\",
+                  \"type\"
+              ],
+              \"type\": \"object\"
+          }
+      },
+      \"properties\": {
+          \"changes\": {
+              \"description\": \"An array of changes to component/directive selectors\",
+              \"items\": {
+                  \"$ref\": \"#/definitions/SelectorChange\"
+              },
+              \"type\": \"array\"
+          }
+      },
+      \"required\": [
+          \"changes\"
+      ],
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"changes\": [
+          {
+              \"selector\": \"app-root\",
+              \"type\": \"component\",
+              \"remove\": false,
+              \"replaceWith\": \"new-root\"
+          },
+          {
+              \"selector\": \"my-directive\",
+              \"type\": \"directive\",
+              \"remove\": true,
+              \"replaceWith\": \"new-directive\"
+          }
+      ]
+    }"
+  - json_schema: "{
+      \"additionalProperties\": false,
+      \"description\": \"Schema for tracking e-commerce transaction details and metadata.\",
+      \"properties\": {
+        \"store\": {
+          \"description\": \"The store or seller associated with the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"discountCode\": {
+          \"description\": \"Promotional code applied to the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"currencyCode\": {
+          \"description\": \"ISO 4217 currency code for the transaction.\",
+          \"maxLength\": 3,
+          \"minLength\": 3,
+          \"type\": \"string\"
+        },
+        \"transactionId\": {
+          \"description\": \"Unique identifier for the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"productList\": {
+          \"description\": \"Identifier for the product list from which the purchase was made.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"purchaseOption\": {
+          \"description\": \"Additional purchase options or preferences.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"totalAmount\": {
+          \"description\": \"Total revenue generated from the transaction.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        },
+        \"deliveryCharge\": {
+          \"description\": \"Shipping cost associated with the order.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        },
+        \"processStep\": {
+          \"description\": \"Current step in the purchase or checkout process.\",
+          \"maximum\": 2147483647,
+          \"minimum\": 0,
+          \"type\": \"integer\"
+        },
+        \"taxAmount\": {
+          \"description\": \"Total tax applied to the transaction.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        }
+      },
+      \"self\": {
+        \"format\": \"jsonschema\",
+        \"name\": \"transactionDataObject\",
+        \"vendor\": \"com.ecommerce.analytics.tracking\",
+        \"version\": \"1-0-0\"
+      },
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"asset_id\": \"minecraft:trim_pattern\",
+      \"description\": {
+          \"color\": \"#FFAA00\",
+          \"translate\": \"trim_pattern.description\"
+      },
+      \"template_item\": \"minecraft:template_item\"
+    }"
+  - json_schema: "{
+      \"$comment\": \"https://minecraft.fandom.com/wiki/Data_Pack\",
+      \"$id\": \"https://json.schemastore.org/minecraft-damage-type.json\",
+      \"$schema\": \"http://json-schema.org/draft-07/schema#\",
+      \"description\": \"A damage type's for a Minecraft data pack config schema\",
+      \"properties\": {
+          \"death_message_type\": {
+              \"enum\": [
+                  \"default\",
+                  \"fall_variants\",
+                  \"intentional_game_design\"
+              ],
+              \"type\": \"string\"
+          },
+          \"effects\": {
+              \"enum\": [
+                  \"hurt\",
+                  \"thorns\",
+                  \"drowning\",
+                  \"burning\",
+                  \"poking\",
+                  \"freezing\"
+              ],
+              \"type\": \"string\"
+          },
+          \"exhaustion\": {
+              \"type\": \"number\"
+          },
+          \"message_id\": {
+              \"type\": \"string\"
+          },
+          \"scaling\": {
+              \"enum\": [
+                  \"never\",
+                  \"always\",
+                  \"when_caused_by_living_non_player\"
+              ],
+              \"type\": \"string\"
+          }
+      },
+      \"required\": [
+          \"message_id\",
+          \"scaling\",
+          \"exhaustion\"
+      ],
+      \"title\": \"Minecraft Data Pack Damage Type\",
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"message_id\": \"minecraft:damage.message\",
+      \"scaling\": \"always\",
+      \"exhaustion\": 0.3,
+      \"death_message_type\": \"default\",
+      \"effects\": \"hurt\"
+    }"
+  - json_schema: "{
+      \"additionalProperties\": false,
+      \"description\": \"Schema for tracking e-commerce transaction details and metadata.\",
+      \"properties\": {
+        \"store\": {
+          \"description\": \"The store or seller associated with the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"discountCode\": {
+          \"description\": \"Promotional code applied to the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"currencyCode\": {
+          \"description\": \"ISO 4217 currency code for the transaction.\",
+          \"maxLength\": 3,
+          \"minLength\": 3,
+          \"type\": \"string\"
+        },
+        \"transactionId\": {
+          \"description\": \"Unique identifier for the transaction.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"productList\": {
+          \"description\": \"Identifier for the product list from which the purchase was made.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"purchaseOption\": {
+          \"description\": \"Additional purchase options or preferences.\",
+          \"maxLength\": 500,
+          \"type\": \"string\"
+        },
+        \"totalAmount\": {
+          \"description\": \"Total revenue generated from the transaction.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        },
+        \"deliveryCharge\": {
+          \"description\": \"Shipping cost associated with the order.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        },
+        \"processStep\": {
+          \"description\": \"Current step in the purchase or checkout process.\",
+          \"maximum\": 2147483647,
+          \"minimum\": 0,
+          \"type\": \"integer\"
+        },
+        \"taxAmount\": {
+          \"description\": \"Total tax applied to the transaction.\",
+          \"multipleOf\": 0.01,
+          \"type\": \"number\"
+        }
+      },
+      \"self\": {
+        \"format\": \"jsonschema\",
+        \"name\": \"transactionDataObject\",
+        \"vendor\": \"com.ecommerce.analytics.tracking\",
+        \"version\": \"1-0-0\"
+      },
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"store\": \"TechGadgets Online\",
+      \"discountCode\": \"SUMMER20\",
+      \"currencyCode\": \"USD\",
+      \"transactionId\": \"TXN123456789\",
+      \"productList\": \"Best Sellers\",
+      \"purchaseOption\": \"Express Shipping\",
+      \"totalAmount\": 299.99,
+      \"deliveryCharge\": 5.99,
+      \"processStep\": 3,
+      \"taxAmount\": 20.50
+    }"
+  - json_schema: "{
+      \"properties\": {
+          \"date\": {
+              \"description\": \"The date of the meeting\",
+              \"type\": \"string\"
+          },
+          \"time\": {
+              \"description\": \"The time of the meeting\",
+              \"type\": \"string\"
+          },
+          \"participants\": {
+              \"description\": \"List of participants' emails\",
+              \"type\": \"array\",
+              \"items\": {
+                  \"type\": \"string\"
+              }
+          }
+      },
+      \"required\": [
+          \"date\",
+          \"time\"
+      ],
+      \"type\": \"object\"
+    }"
+    json_object: "{
+      \"date\": \"2024-09-30\",
+      \"time\": \"10:00 AM\",
+      \"participants\": [
+          \"alice@example.com\",
+          \"bob@example.com\"
+      ]
+    }"
--- a/lm_eval/tasks/jsonschema_bench/jsonschema_bench_hard.yaml
+++ b/lm_eval/tasks/jsonschema_bench/jsonschema_bench_hard.yaml
+include: jsonschema_bench_easy.yaml
+task: jsonschema_bench_hard
+dataset_name: Github_hard
--- a/lm_eval/tasks/jsonschema_bench/jsonschema_bench_medium.yaml
+++ b/lm_eval/tasks/jsonschema_bench/jsonschema_bench_medium.yaml
+include: jsonschema_bench_easy.yaml
+task: jsonschema_bench_medium
+dataset_name: Github_medium
--- a/lm_eval/tasks/jsonschema_bench/metrics.py
+++ b/lm_eval/tasks/jsonschema_bench/metrics.py
+import ipaddress
+import json
+import logging
+import uuid
+from typing import Any, Dict
+# check if jsonschema is installed
+try:
+    import jsonschema
+    from jsonschema import Draft202012Validator, FormatChecker, ValidationError
+except ImportError as e:
+    raise ImportError(
+        "jsonschema is not installed. Please install it using 'pip install jsonschema[format]'"
+    ) from e
+eval_logger = logging.getLogger(__name__)
+def is_json_schema_valid(schema: dict):
+    """
+    Check if a JSON schema is valid.
+    :param schema: A JSON schema.
+    :return: True if the schema is valid, False otherwise.
+    """
+    try:
+        # Check if the schema is valid
+        jsonschema.Draft202012Validator.check_schema(schema)
+        return True
+    except jsonschema.SchemaError:
+        return False
+# Initialize the FormatChecker
+format_checker = FormatChecker()
+# Add custom format checkers
+@format_checker.checks("ipv4")
+def ipv4_check(value):
+    ipaddress.IPv4Address(value)
+@format_checker.checks("ipv6")
+def ipv6_check(value):
+    ipaddress.IPv6Address(value)
+@format_checker.checks("uuid")
+def uuid_check(value):
+    uuid.UUID(value)
+def schema_conform_with_format_checker(
+    instance: Dict[str, Any], schema: Dict[str, Any]
+) -> bool:
+    """
+    Validate a JSON instance against a schema with enhanced format checking.
+    :param schema: The JSON schema to validate against.
+    :param instance: The JSON instance to validate.
+    :raises ValidationError: If the validation fails.
+    """
+    # first check if the schema is valid
+    if not is_json_schema_valid(schema):
+        raise ValidationError("The JSON schema is invalid.")
+    validator = Draft202012Validator(schema, format_checker=format_checker)
+    try:
+        validator.validate(instance)
+    except ValidationError as e:
+        raise ValidationError(e.message)
+    return True
+def schema_compliance(references: list[str], predictions: list[str]) -> bool:
+    assert len(references) == 1, (
+        "We only have one reference for this task, which is the JSON schema."
+    )
+    assert len(predictions) == 1, (
+        "Currently, we don't support pass@k for JSON schema validation."
+    )
+    reference = references[0]
+    prediction = predictions[0]  # Since predictions is a list of lists
+    json_schema = json.loads(reference.strip())
+    try:
+        json_obj = json.loads(prediction.strip().strip("```").strip("json"))
+    except json.JSONDecodeError:
+        return False
+    try:
+        schema_conform = schema_conform_with_format_checker(json_obj, json_schema)
+    except Exception as e:
+        eval_logger.error(f"Error: {e}")
+        return False
+    return schema_conform
+def json_validity(references: list[str], predictions: list[str]) -> bool:
+    assert len(predictions) == 1, (
+        "Currently, we don't support pass@k for JSON schema validation."
+    )
+    prediction = predictions[0]  # Since predictions is a list of lists
+    try:
+        json.loads(prediction.strip().strip("```").strip("json").strip())
+    except json.JSONDecodeError:
+        return False
+    return True