Commits · dcaabcf7bcb0ea78ed31797208dbe9238bb27963 · OpenDAS / vllm_cscc

"docker/Dockerfile.neuron" did not exist on "0f3f3c86ec44467fa80b60cb9f971f9ede028f76"

09 Aug, 2024 1 commit
- 增添lmslim gptq支持 · dcaabcf7
  gaoqiong authored Aug 09, 2024
  
  dcaabcf7
06 Aug, 2024 1 commit
- update version and skip install gptq_kernels · 58fb0c33
  zhuwenwen authored Aug 06, 2024
  
  58fb0c33
04 Aug, 2024 1 commit
- add llama model awq support · 5f5ddc3d
  gaoqiong authored Aug 04, 2024
  
  5f5ddc3d
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
11 Apr, 2024 2 commits
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
12 Feb, 2024 1 commit
- Refactor 2 awq gemm kernels into m16nXk32 (#2723) · 56383649
  Rex authored Feb 12, 2024
```
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
```
  56383649
01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
27 Jan, 2024 1 commit
- AWQ: Up to 2.66x higher throughput (#2566) · beb89f68
  Casper authored Jan 27, 2024
  
  beb89f68
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
24 Nov, 2023 1 commit
- [Build] Avoid building too many extensions (#1624) · e0c6f556
  Yanming W authored Nov 23, 2023
  
  e0c6f556
20 Nov, 2023 1 commit
- Migrate linter from `pylint` to `ruff` (#1665) · 5ffc0d13
  Simon Mo authored Nov 20, 2023
  
  5ffc0d13
19 Nov, 2023 1 commit
- Add AWQ support for all models (#1714) · 8d17774f
  Woosuk Kwon authored Nov 18, 2023
  
  8d17774f
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c