Unverified Commit 3efa5bbb authored by Nick Hill's avatar Nick Hill Committed by GitHub
Browse files

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
parent 686cc667
...@@ -131,7 +131,7 @@ fn validation_worker( ...@@ -131,7 +131,7 @@ fn validation_worker(
} }
// Get the number of tokens in the input // Get the number of tokens in the input
match tokenizer.encode(request.inputs.clone(), false) { match tokenizer.encode(request.inputs.clone(), true) {
Ok(inputs) => { Ok(inputs) => {
let input_length = inputs.len(); let input_length = inputs.len();
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment