Unverified Commit aaea212d authored by Martin Iglesias Goyanes's avatar Martin Iglesias Goyanes Committed by GitHub
Browse files

Add links to Adyen blogpost (#2500)



* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------
Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
parent a3c9c62d
...@@ -189,7 +189,7 @@ overridden with the `--otlp-service-name` argument ...@@ -189,7 +189,7 @@ overridden with the `--otlp-service-name` argument
![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png) ![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)
Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi) Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)
### Local install ### Local install
......
...@@ -71,6 +71,8 @@ ...@@ -71,6 +71,8 @@
title: How Guidance Works (via outlines) title: How Guidance Works (via outlines)
- local: conceptual/lora - local: conceptual/lora
title: LoRA (Low-Rank Adaptation) title: LoRA (Low-Rank Adaptation)
- local: conceptual/external
title: External Resources
title: Conceptual Guides title: Conceptual Guides
# External Resources
- Adyen wrote a detailed article about the interplay between TGI's main components: router and server.
[LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)
...@@ -155,7 +155,3 @@ SSEs are different than: ...@@ -155,7 +155,3 @@ SSEs are different than:
* Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP. * Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP.
If there are too many requests at the same time, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure. If there are too many requests at the same time, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure.
## External sources
Adyen wrote a nice recap of how TGI streaming feature works. [LLM inference at scale with TGI](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment