move turbomind.md to docs/en (#118)

* move turbomind.md to docs/en * update link * update link

move turbomind.md to docs/en (#118)
* move turbomind.md to docs/en * update link * update link
79deb99d · lvhan028 · GitHub · 4cfb118f · 79deb99d
Unverified Commit 79deb99d authored Jul 14, 2023 by lvhan028 Committed by GitHub Jul 14, 2023
Show whitespace changes
Inline Side-by-side

Showing with 8 additions and 9 deletions

docs/en/turbomind.md docs/en/turbomind.md +8 -9

No files found.
--- a/docs/architecture/turbomind.md
+++ b/docs/architecture/turbomind.md
@@ -33,10 +33,9 @@ You may recognize this feature as "continuous batching" in other repos. But duri
 - __On cache-hits (see below), history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.__
 - The batch grows or shrinks automatically to minimize unnecessary computations.

-
 ## KV Cache Manager

-The [KV cache manager](/src/turbomind/models/llama/LlamaCacheManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
+The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/LlamaCacheManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way

 - All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
 - When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.
@@ -68,4 +67,4 @@ Apart of the features described above, there are still many minor differences th

 ### Supporting Huggingface models

-For historical reasons, TurboMind's weight layout is based on [the original LLaMa implementation](https://github.com/facebookresearch/llama) (differ only by a transpose). The implementation in huggingface transformers uses a [different layout](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76) for `W_q` and `W_k` which is handled in [deploy.py](/lmdeploy/serve/turbomind/deploy.py#L362).
+For historical reasons, TurboMind's weight layout is based on [the original LLaMa implementation](https://github.com/facebookresearch/llama) (differ only by a transpose). The implementation in huggingface transformers uses a [different layout](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76) for `W_q` and `W_k` which is handled in [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398).