guidance.md 4.29 KB
Newer Older
Nicolas Patry's avatar
Nicolas Patry committed
1
2
# Guidance

3
## What is Guidance?
Nicolas Patry's avatar
Nicolas Patry committed
4

5
6

Guidance is a feature that allows users to constrain the generation of a large language model with a specified grammar. This feature is particularly useful when you want to generate text that follows a specific structure or uses a specific set of words or produce output in a specific format. A prominent example is JSON grammar, where the model is forced to output valid JSON.
7

8
## How is it used?
Nicolas Patry's avatar
Nicolas Patry committed
9

10
Guidance can be implemented in many ways and the community is always finding new ways to use it. Here are some examples of how you can use guidance:
Nicolas Patry's avatar
Nicolas Patry committed
11

12
Technically, guidance can be used to generate:
Nicolas Patry's avatar
Nicolas Patry committed
13

14
15
16
- a specific JSON object
- a function signature
- typed output like a list of integers
Nicolas Patry's avatar
Nicolas Patry committed
17

18
However these use cases can span a wide range of applications, such as:
drbh's avatar
drbh committed
19

20
21
22
23
24
25
- extracting structured data from unstructured text
- summarizing text into a specific format
- limit output to specific classes of words (act as a LLM powered classifier)
- generate the input to specific APIs or services
- provide reliable and consistent output for downstream tasks
- extract data from multimodal inputs
drbh's avatar
drbh committed
26

27
## How it works?
Nicolas Patry's avatar
Nicolas Patry committed
28

29
Diving into the details, guidance is enabled by including a grammar with a generation request that is compiled, and used to modify the chosen tokens.
Nicolas Patry's avatar
Nicolas Patry committed
30

31
This process can be broken down into the following steps:
Nicolas Patry's avatar
Nicolas Patry committed
32

33
1. A request is sent to the backend, it is processed and placed in batch. Processing includes compiling the grammar into a finite state machine and a grammar state.
Nicolas Patry's avatar
Nicolas Patry committed
34

35
36
37
38
39
40
41
42
43
44
<div class="flex justify-center">
    <img
        class="block dark:hidden"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/request-to-batch.gif"
    />
    <img
        class="hidden dark:block"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/request-to-batch-dark.gif"
    />
</div>
Nicolas Patry's avatar
Nicolas Patry committed
45

46
2. The model does a forward pass over the batch. This returns probabilities for each token in the vocabulary for each request in the batch.
Nicolas Patry's avatar
Nicolas Patry committed
47

48
3. The process of choosing one of those tokens is called `sampling`. The model samples from the distribution of probabilities to choose the next token. In TGI all of the steps before sampling are called `processor`. Grammars are applied as a processor that masks out tokens that are not allowed by the grammar.
Nicolas Patry's avatar
Nicolas Patry committed
49

50
51
52
53
54
55
56
57
58
59
<div class="flex justify-center">
    <img
        class="block dark:hidden"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/logit-grammar-mask.gif"
    />
    <img
        class="hidden dark:block"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/logit-grammar-mask-dark.gif"
    />
</div>
Nicolas Patry's avatar
Nicolas Patry committed
60

61
4. The grammar mask is applied and the model samples from the remaining tokens. Once a token is chosen, we update the grammar state with the new token, to prepare it for the next pass.
Nicolas Patry's avatar
Nicolas Patry committed
62

63
64
65
66
67
68
69
70
71
72
<div class="flex justify-center">
    <img
        class="block dark:hidden"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/sample-logits.gif"
    />
    <img
        class="hidden dark:block"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/sample-logits-dark.gif"
    />
</div>
Nicolas Patry's avatar
Nicolas Patry committed
73

74
## How to use Guidance?
Nicolas Patry's avatar
Nicolas Patry committed
75

76
There are two main ways to use guidance; you can either use the `/generate` endpoint with a grammar or use the `/chat/completion` endpoint with tools.
Nicolas Patry's avatar
Nicolas Patry committed
77

78
Under the hood tools are a special case of grammars that allows the model to choose one or none of the provided tools.
Nicolas Patry's avatar
Nicolas Patry committed
79

80
Please refer to [using guidance](../basic_tutorials/using_guidance) for more examples and details on how to use guidance in Python, JavaScript, and cURL.
Nicolas Patry's avatar
Nicolas Patry committed
81

82
### Getting the most out of guidance
Nicolas Patry's avatar
Nicolas Patry committed
83

84
Depending on how you are using guidance, you may want to make use of different features. Here are some tips to get the most out of guidance:
Nicolas Patry's avatar
Nicolas Patry committed
85

86
87
- If you are using the `/generate` with a `grammar` it is recommended to include the grammar in the prompt prefixed by something like `Please use the following JSON schema to generate the output:`. This will help the model understand the context of the grammar and generate the output accordingly.
- If you are getting a response with many repeated tokens, please use the `frequency_penalty` or `repetition_penalty` to reduce the number of repeated tokens in the output.