Commits · 41c2623735819bcb370063795127153dcee1e7a8 · OpenDAS / text-generation-inference

"vscode:/vscode.git/clone" did not exist on "51fd114d8b8eed19226870ee7fd12dba1e25d550"

23 Oct, 2024 2 commits
- feat: allow any supported payload on /invocations (#2683) · 41c26237
  OlivierDehaene authored Oct 23, 2024
```
* feat: allow any supported payload on /invocations

* update openAPI

* update doc
```
  41c26237
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
17 Oct, 2024 1 commit
- Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
  Daniël de Kok authored Oct 17, 2024
```
* Support `e4m3fn` KV cache

* Make check more obvious
```
  5bbe1ce0
14 Oct, 2024 2 commits

Clarify gated description and quicktour (#2631) · 51f54018
Omar Sanseviero authored Oct 14, 2024
```
Update quicktour.md
```
51f54018

Small fixes for supported models (#2471) · ce28ee88

Omar Sanseviero authored Oct 14, 2024



* Small improvements for docs

* Update _toctree.yml

* Updating the doc (we keep the list actually).

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

ce28ee88

10 Oct, 2024 1 commit
- Update documentation to most recent stable version of TGI. (#2625) · d912f0bf
  vb authored Oct 10, 2024
```
Update to most recent stable version of TGI.
```
  d912f0bf
04 Oct, 2024 1 commit

Add basic FP8 KV cache support (#2603) · 2358c2bb

Daniël de Kok authored Oct 04, 2024

* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml

2358c2bb

03 Oct, 2024 2 commits
- Revert "Unroll notify error into generate response" (#2605) · 3011639f
  drbh authored Oct 03, 2024
```
Revert "Unroll notify error into generate response (#2597)"

This reverts commit d22b0c1f.
```
  3011639f
- New release 2.3.1 (#2604) · f6e2f05b
  Nicolas Patry authored Oct 03, 2024
```
* New release 2.3.1

* Update doc number
```
  f6e2f05b
02 Oct, 2024 3 commits

Unroll notify error into generate response (#2597) · d22b0c1f

drbh authored Oct 02, 2024

* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file

d22b0c1f

CI (2592): Allow LoRA adapter revision in server launcher (#2602) · 23354595

drbh authored Oct 02, 2024



allow revision for lora adapters from launcher
Co-authored-by: Sida <sida@kulamind.com>
Co-authored-by: teamclouday <teamclouday@gmail.com>

23354595

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

30 Sep, 2024 3 commits

feat: support phi3.5 moe (#2479) · 93a7042d

drbh authored Sep 30, 2024



* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

93a7042d

Update ROCM libs and improvements (#2579) · f9e561ec

Mohit Sharma authored Sep 30, 2024

* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile

f9e561ec

Update architecture.md (#2577) · e790cfc0
Ikram Ul Haq authored Sep 30, 2024

e790cfc0

24 Sep, 2024 2 commits

remove LORA_ADAPTERS_PATH (#2563) · 7efcb5e0
Nicholas Broad authored Sep 24, 2024
```
specify how to call local adapters
```
7efcb5e0

Adding note for private models in quick-tour document (#2548) · e6d29656

Aritra Roy Gosthipaty authored Sep 24, 2024



* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>

---------
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>

e6d29656

20 Sep, 2024 1 commit
- Preparing for release. (#2540) · 169178b9
  Nicolas Patry authored Sep 20, 2024
```
* Preparing for release.

* Upgrade version in docs.
```
  169178b9
19 Sep, 2024 1 commit
- doc: clarify that `--quantize` is not needed for pre-quantized models (#2536) · abd24dd3
  Daniël de Kok authored Sep 19, 2024
  
  abd24dd3
06 Sep, 2024 1 commit

Add links to Adyen blogpost (#2500) · aaea212d

Martin Iglesias Goyanes authored Sep 06, 2024



* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

aaea212d

05 Sep, 2024 1 commit
- Adding links to Adyen blogpost. (#2492) · 8b96a182
  Nicolas Patry authored Sep 05, 2024
  
  8b96a182
29 Aug, 2024 1 commit

update doc with intel cpu part (#2420) · 9883f3b4

Wang, Yi authored Aug 29, 2024



* update doc with intel cpu part
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

9883f3b4

28 Aug, 2024 1 commit
- fix: improve regex expression (#2468) · 8f99f165
  drbh authored Aug 28, 2024
  
  8f99f165
16 Aug, 2024 3 commits

doc: Add metrics documentation and add a 'Reference' section (#2230) · 53729b74

Hugo Larcher authored Aug 16, 2024



* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

53729b74

FIxing the CI. · cb0a2948
Nicolas Patry authored Aug 16, 2024

cb0a2948

Improve the Consuming TGI + Streaming docs. (#2412) · 99b662f8

Vaibhav Srivastav authored Aug 16, 2024



* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md
Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

99b662f8

09 Aug, 2024 2 commits
- Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
  Nicolas Patry authored Aug 09, 2024
```
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
```
  7a48a847
- Update documentation for Supported models (#2386) · b2b9c427
  Vaibhav Srivastav authored Aug 09, 2024
```
* Minor doc fixes

* up.

* Other minor updates.
```
  b2b9c427
08 Aug, 2024 2 commits

Update Quantization docs and minor doc fix. (#2368) · cb3ae302

Vaibhav Srivastav authored Aug 08, 2024



* Update Quantization docs and minor doc fix.

* update readme with latest quants info

* Apply suggestions from code review
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* up

---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

cb3ae302

add gptj modeling in TGI #2366 (CI RUN) (#2372) · 21267f3c

drbh authored Aug 07, 2024



* add gptj modeling
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>

21267f3c

05 Aug, 2024 1 commit
- feat: include local lora adapter loading docs (#2359) · dd47a3da
  drbh authored Aug 05, 2024
  
  dd47a3da
31 Jul, 2024 1 commit

refactor usage stats (#2339) · 7451041e

Erik Kaunismäki authored Jul 31, 2024



* refactor usage stats

* Update docs/source/usage_statistics.md
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* changes based on feedback

* run python3 udpate_doc.py

* fix pre-commit

* Update router/src/server.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete option around usage stats arg

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

7451041e

29 Jul, 2024 1 commit

Run ci api key (#2315) · 583d37a2

Erik Kaunismäki authored Jul 29, 2024



* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------
Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>

583d37a2

23 Jul, 2024 1 commit

Preparing for release. (#2285) · 5d121a97

Nicolas Patry authored Jul 23, 2024

* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.

5d121a97

19 Jul, 2024 3 commits

Add support for Deepseek V2 (#2224) · e52be9bb

Daniël de Kok authored Jul 19, 2024

Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.

e52be9bb

add usage stats to toctree (#2260) · 40f5dc3e
Erik Kaunismäki authored Jul 19, 2024
```
quick fix
```
40f5dc3e

usage stats and crash reports (#2220) · 4c19593a

Erik Kaunismäki authored Jul 19, 2024



* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>

4c19593a

09 Jul, 2024 2 commits
- Updating the self check (#2209) · 4c976fb4
  Nicolas Patry authored Jul 09, 2024
```
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
```
  4c976fb4
- Adding sanity check to openapi docs. · fe710af2
  Nicolas Patry authored Jul 09, 2024
  
  fe710af2
08 Jul, 2024 1 commit
- add doc for intel gpus (#2181) · 07e240ca
  Wang, Yi authored Jul 08, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  07e240ca