Unverified Commit 529f4805 authored by Jinwei's avatar Jinwei Committed by GitHub
Browse files

[Readme change for SGLang] fix error in readme and add OOM solutions for sglang (#2738)



* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

* add OOM guideline for sglang and fix readme error

* fix typo

* fix typo

* add readme

---------
Co-authored-by: default avatarXiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: default avatarBaber <baber@hey.com>
parent a87fe425
......@@ -250,10 +250,16 @@ To use SGLang as the evaluation backend, please **install it in advance** via SG
SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here:
```bash
lm_eval --model sglang \
--model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto,mem-fraction-static=0.9, \
--model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \
--tasks gsm8k_cot \
--batch_size auto
```
> [!Tip]
> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
> 1. Use a manual `batch_size`, rather than `auto`.
> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
### Model APIs and Inference Servers
Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment