Commits · 90a1d04a2f560df25a2786fcc1f117a05650dd7d · OpenDAS / text-generation-inference

30 Sep, 2024 4 commits

Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a

Daniël de Kok authored Sep 30, 2024

This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.

90a1d04a

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

Update architecture.md (#2577) · e790cfc0
Ikram Ul Haq authored Sep 30, 2024

e790cfc0

Remove compute capability lazy cell (#2580) · afc7ded8

Daniël de Kok authored Sep 30, 2024

Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.

afc7ded8

28 Sep, 2024 1 commit
- flashinfer: pass window size and dtype (#2574) · 1028996f
  Daniël de Kok authored Sep 28, 2024
  
  1028996f
27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

26 Sep, 2024 2 commits
- Fix build with `--features google` (#2566) · 0aa66d69
  Alvaro Bartolome authored Sep 26, 2024
```
* Fix `cargo build --features google`

* Add `cargo test --features google`
```
  0aa66d69
- Add LoRA adapters support for Gemma2 (#2567) · 0b7df771
  Alvaro Bartolome authored Sep 26, 2024
```
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
```
  0b7df771
24 Sep, 2024 13 commits

remove LORA_ADAPTERS_PATH (#2563) · 7efcb5e0
Nicholas Broad authored Sep 24, 2024
```
specify how to call local adapters
```
7efcb5e0
More tensor cores. (#2558) · dd8691b7
Nicolas Patry authored Sep 24, 2024
```
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
```
dd8691b7

Cleanup Vertex + Chat (#2553) · c032280b

Nicolas Patry authored Sep 24, 2024

* Cleanup Vertex + Chat

* logprobs defaults to false.

* Parameters are optional

* Fix  docs.

* Changing back this logprobs default.

* Fixup doc.

* Let's debug that.

* Not unstable.

* Updating Cargo ?

* Wat?

* Dummy change.

* Trying some other install.

* Trying smething.

* Revert everything.

* Update Cargo lock.

* Fixing the pre-commit after rebase.

c032280b

Hotfixing main. (#2562) · 75c8c54a
Nicolas Patry authored Sep 24, 2024

75c8c54a

Adding note for private models in quick-tour document (#2548) · e6d29656

Aritra Roy Gosthipaty authored Sep 24, 2024



* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>

---------
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>

e6d29656

Simplify crossterm imports (#2545) · 8024ded5
Orhun Parmaksız authored Sep 24, 2024

8024ded5
Update the link to the Ratatui organization (#2546) · 03263f5e
Orhun Parmaksız authored Sep 24, 2024

03263f5e
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537) · 3f14cd14
Daniël de Kok authored Sep 24, 2024
```
This replaces the custom layers in both models.
```
3f14cd14

Add support for scalar FP8 weight scales (#2550) · c29dc89c

Daniël de Kok authored Sep 24, 2024

* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print

c29dc89c

Hotfixing main (#2556) · 0ff6ff60
Nicolas Patry authored Sep 24, 2024

0ff6ff60
Micro cleanup. (#2555) · 74d3ce10
Nicolas Patry authored Sep 24, 2024

74d3ce10
Remove duplicated `RUN` in `Dockerfile` (#2547) · d31a6f75
Alvaro Bartolome authored Sep 24, 2024

d31a6f75
chore: Add old V2 backend (#2551) · 10e6f292
OlivierDehaene authored Sep 24, 2024
```
* wip

* added v2
```
10e6f292

23 Sep, 2024 1 commit
- nix: remove unused `_server.nix` file (#2538) · 9263817c
  Daniël de Kok authored Sep 23, 2024
  
  9263817c
20 Sep, 2024 3 commits
- Preparing for release. (#2540) · 169178b9
  Nicolas Patry authored Sep 20, 2024
```
* Preparing for release.

* Upgrade version in docs.
```
  169178b9
- fix: wrap python basic logs in debug assertion in launcher (#2539) · 7e2d1887
  OlivierDehaene authored Sep 20, 2024
```
* fix: wrap python basic logs in debug assertion in launcher

* use level filters instead
```
  7e2d1887
- hotfix: ipex fails since cuda moe kernel is not supported (#2532) · f478aa77
  Wang, Yi authored Sep 20, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  f478aa77
19 Sep, 2024 3 commits

doc: clarify that `--quantize` is not needed for pre-quantized models (#2536) · abd24dd3
Daniël de Kok authored Sep 19, 2024

abd24dd3
Update to moe-kenels 0.3.1 (#2535) · c1037601
Daniël de Kok authored Sep 19, 2024
```
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
```
c1037601

Stream options. (#2533) · f512021e

Nicolas Patry authored Sep 19, 2024

* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow

f512021e

17 Sep, 2024 3 commits

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

fix: metrics unbounded memory (#2528) · 86984e32
OlivierDehaene authored Sep 17, 2024

86984e32
nix: pure Rust check/fmt/clippy/test (#2525) · 71e42686
Daniël de Kok authored Sep 17, 2024
```
Runs the tests in a Nix build sandbox.
```
71e42686

16 Sep, 2024 2 commits

Adding a test for FD. (#2516) · 38fcafcf

Nicolas Patry authored Sep 16, 2024

* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.

38fcafcf

Add tests for Mixtral (#2520) · 77746552
Daniël de Kok authored Sep 16, 2024
```
Disable by default because CI runners do not have enough GPUs.
```
77746552

13 Sep, 2024 1 commit
- Use `ratatui` not (deprecated) `tui` (#2521) · 9cca3e0b
  Alex Strick van Linschoten authored Sep 13, 2024
```
* use ratatui not archived tui

* bump ratatui all the way with options
```
  9cca3e0b
12 Sep, 2024 4 commits

hotfix : enable intel ipex cpu and xpu in python3.11 (#2517) · 3ac7df2b
Wang, Yi authored Sep 12, 2024
```
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
3ac7df2b
fix: pass missing revision arg for lora adapter when loading multiple… (#2510) · 628334d3
drbh authored Sep 12, 2024
```
fix: pass missing revision arg for lora adapter when loading multiple adapters
```
628334d3

Add nix test. (#2513) · d95c670a

Nicolas Patry authored Sep 12, 2024

* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.

d95c670a

nix: support Python tokenizer conversion in the router (#2515) · 94304649

Daniël de Kok authored Sep 12, 2024

Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.

94304649

11 Sep, 2024 2 commits

Fix truffle (#2514) · 69e3be20

Nicolas Patry authored Sep 11, 2024

* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.

69e3be20

Fix tokenization yi (#2507) · dae3bf1d

Nicolas Patry authored Sep 11, 2024

* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?

dae3bf1d