Unverified Commit 5c25dd55 authored by Michael Goin's avatar Michael Goin Committed by GitHub
Browse files

Add `local-completions` support using OpenAI interface (#1277)



* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 076372ee
...@@ -172,7 +172,7 @@ lm_eval --model openai-completions \ ...@@ -172,7 +172,7 @@ lm_eval --model openai-completions \
--tasks lambada_openai,hellaswag --tasks lambada_openai,hellaswag
``` ```
We also support using your own local inference server with an implemented version of the OpenAI ChatCompletions endpoint and passing trained HuggingFace artifacts and tokenizers. We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.
```bash ```bash
lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1 lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1
...@@ -181,7 +181,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_ ...@@ -181,7 +181,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: | | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------| |---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai-completions` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | OpenAI Completions | :heavy_check_mark: | `openai-completions`, `local-completions` | All OpenAI Completions API models | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :heavy_check_mark: | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) | | OpenAI ChatCompletions | :heavy_check_mark: | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) | | Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
...@@ -189,9 +189,12 @@ Note that for externally hosted models, configs such as `--device` and `--batch_ ...@@ -189,9 +189,12 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) | | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
| vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your local inference server! | :heavy_check_mark: | `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... | | Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
| `local-completions` (using `openai-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface | `generate_until` | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
### Other Frameworks ### Other Frameworks
......
...@@ -2,14 +2,14 @@ import copy ...@@ -2,14 +2,14 @@ import copy
import os import os
from collections import defaultdict from collections import defaultdict
from importlib.util import find_spec from importlib.util import find_spec
from typing import List, Optional, Tuple from typing import List, Literal, Optional, Tuple
from tqdm import tqdm from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.utils import retry_on_specific_exceptions from lm_eval.utils import eval_logger, retry_on_specific_exceptions
def get_result(response, ctxlen: int) -> Tuple[float, bool]: def get_result(response, ctxlen: int) -> Tuple[float, bool]:
...@@ -40,7 +40,7 @@ def get_result(response, ctxlen: int) -> Tuple[float, bool]: ...@@ -40,7 +40,7 @@ def get_result(response, ctxlen: int) -> Tuple[float, bool]:
return continuation_logprobs, is_greedy return continuation_logprobs, is_greedy
def oa_completion(**kwargs): def oa_completion(client, chat: bool = False, **kwargs):
"""Query OpenAI API for completion. """Query OpenAI API for completion.
Retry with back-off until they respond Retry with back-off until they respond
...@@ -64,19 +64,24 @@ def oa_completion(**kwargs): ...@@ -64,19 +64,24 @@ def oa_completion(**kwargs):
on_exception_callback=_exception_callback, on_exception_callback=_exception_callback,
) )
def completion(): def completion():
return openai.completions.create(**kwargs) if chat:
return client.chat.completions.create(**kwargs)
else:
return client.completions.create(**kwargs)
return completion() return completion()
@register_model("openai-completions") @register_model("openai-completions", "local-completions")
class OpenaiCompletionsLM(LM): class OpenaiCompletionsLM(LM):
REQ_CHUNK_SIZE = 20
_DEFAULT_MAX_LENGTH = 2048 _DEFAULT_MAX_LENGTH = 2048
def __init__( def __init__(
self, self,
model: str, model: str,
base_url: str = None,
tokenizer: Optional[str] = None,
tokenizer_backend: Literal["tiktoken", "huggingface"] = "tiktoken",
truncate: bool = False, truncate: bool = False,
max_gen_toks: int = 256, max_gen_toks: int = 256,
batch_size: int = 1, batch_size: int = 1,
...@@ -101,15 +106,44 @@ class OpenaiCompletionsLM(LM): ...@@ -101,15 +106,44 @@ class OpenaiCompletionsLM(LM):
please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`", please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
) )
self.model = model self.model = model
self.tokenizer = tiktoken.encoding_for_model(self.model) self.base_url = base_url
self.vocab_size = self.tokenizer.n_vocab self.tokenizer_backend = tokenizer_backend
self.truncate = truncate self.truncate = truncate
self.end_of_text_token_id = self.tokenizer.eot_token self._batch_size = batch_size
self._max_gen_toks = max_gen_toks self._max_gen_toks = max_gen_toks
self._max_length = max_length self._max_length = max_length
# if we have a local model, use HF tokenizer over tiktoken
if self.tokenizer_backend == "huggingface":
import transformers # noqa: E401
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
tokenizer if tokenizer else self.model
)
self.vocab_size = self.tokenizer.vocab
self.end_of_text_token_id = self.tokenizer.eos_token
elif self.tokenizer_backend == "tiktoken":
if self.base_url:
eval_logger.warning(
f"Passed `base_url={self.base_url}` but using Tiktoken tokenizer backend. "
"Pass `tokenizer_backend=huggingface` and provide the HF tokenizer name if your model does not use Tiktoken."
)
self.tokenizer = tiktoken.encoding_for_model(self.model)
self.vocab_size = self.tokenizer.n_vocab
self.end_of_text_token_id = self.tokenizer.eot_token
else:
raise ValueError(
f"Expected tokenizer_backend to be one of ['tiktoken', 'huggingface'] but got {self.tokenizer_backend}"
)
# Read from environment variable OPENAI_API_KEY # Read from environment variable OPENAI_API_KEY
# Set to EMPTY for local
openai.api_key = os.environ["OPENAI_API_KEY"] openai.api_key = os.environ["OPENAI_API_KEY"]
if self.base_url:
self.client = openai.OpenAI(base_url=self.base_url)
else:
self.client = openai.OpenAI()
@property @property
def eot_token_id(self): def eot_token_id(self):
...@@ -127,9 +161,8 @@ class OpenaiCompletionsLM(LM): ...@@ -127,9 +161,8 @@ class OpenaiCompletionsLM(LM):
return self._max_gen_toks return self._max_gen_toks
@property @property
def batch_size(self): def batch_size(self) -> int:
# Isn't used because we override _loglikelihood_tokens return self._batch_size
raise NotImplementedError()
@property @property
def device(self): def device(self):
...@@ -186,7 +219,7 @@ class OpenaiCompletionsLM(LM): ...@@ -186,7 +219,7 @@ class OpenaiCompletionsLM(LM):
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
for chunk in tqdm( for chunk in tqdm(
list(utils.chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE)), list(utils.chunks(re_ord.get_reordered(), self.batch_size)),
disable=disable_tqdm, disable=disable_tqdm,
): ):
inps = [] inps = []
...@@ -203,6 +236,7 @@ class OpenaiCompletionsLM(LM): ...@@ -203,6 +236,7 @@ class OpenaiCompletionsLM(LM):
ctxlens.append(ctxlen) ctxlens.append(ctxlen)
response = oa_completion( response = oa_completion(
client=self.client,
model=self.model, model=self.model,
prompt=inps, prompt=inps,
echo=True, echo=True,
...@@ -251,7 +285,7 @@ class OpenaiCompletionsLM(LM): ...@@ -251,7 +285,7 @@ class OpenaiCompletionsLM(LM):
# todo: more intelligent batching for heterogeneous `until` # todo: more intelligent batching for heterogeneous `until`
for chunk, request_args in tqdm( for chunk, request_args in tqdm(
list(sameuntil_chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE)) list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size))
): ):
inps = [] inps = []
self._max_gen_toks = request_args.pop("max_gen_toks", self.max_gen_toks) self._max_gen_toks = request_args.pop("max_gen_toks", self.max_gen_toks)
...@@ -265,6 +299,7 @@ class OpenaiCompletionsLM(LM): ...@@ -265,6 +299,7 @@ class OpenaiCompletionsLM(LM):
request_args["temperature"] = request_args.get("temperature", 0) request_args["temperature"] = request_args.get("temperature", 0)
response = oa_completion( response = oa_completion(
client=self.client,
model=self.model, model=self.model,
prompt=inps, prompt=inps,
max_tokens=self.max_gen_toks, max_tokens=self.max_gen_toks,
...@@ -329,35 +364,6 @@ class OpenaiCompletionsLM(LM): ...@@ -329,35 +364,6 @@ class OpenaiCompletionsLM(LM):
return loglikelihoods return loglikelihoods
def oa_chat_completion(client, **kwargs):
"""Query OpenAI API for chat completion.
Retry with back-off until they respond
"""
if not find_spec("openai") or not find_spec("tiktoken"):
raise Exception(
"attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. "
"Please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`"
)
else:
import openai
def _exception_callback(e: Exception, sleep_time: float) -> None:
import traceback
traceback.print_exc()
@retry_on_specific_exceptions(
on_exceptions=[openai.OpenAIError],
max_retries=None, # retry forever, consider changing
on_exception_callback=_exception_callback,
)
def completion():
return client.chat.completions.create(**kwargs)
return completion()
@register_model("openai-chat-completions", "local-chat-completions") @register_model("openai-chat-completions", "local-chat-completions")
class OpenaiChatCompletionsLM(LM): class OpenaiChatCompletionsLM(LM):
def __init__( def __init__(
...@@ -460,8 +466,12 @@ class OpenaiChatCompletionsLM(LM): ...@@ -460,8 +466,12 @@ class OpenaiChatCompletionsLM(LM):
f"Expected repr(kwargs) to be of type repr(dict) but got {kwargs}" f"Expected repr(kwargs) to be of type repr(dict) but got {kwargs}"
) )
response = oa_chat_completion( response = oa_completion(
client=self.client, messages=inps, model=self.model, **kwargs client=self.client,
chat=True,
messages=inps,
model=self.model,
**kwargs,
) )
for resp, (context, args_) in zip(response.choices, chunk): for resp, (context, args_) in zip(response.choices, chunk):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment