Commit Graph

75 Commits

Author SHA1 Message Date
hissu-hyvarinen 5208dc7a20 [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs (#9279)
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
2024-11-04 11:37:46 -08:00
youkaichao cea808f325 [3/N] model runner pass the whole config to model (#9958)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 12:08:49 -07:00
youkaichao e893795443 [2/N] executor pass the complete config to worker/modelrunner (#9938)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-02 07:35:05 -07:00
wangshuai09 622b7ab955 [Hardware] using current_platform.seed_everything (#9785)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
wangshuai09 4e2d95e372 [Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Jee Jee Li a48e3ec052 [CI/Build][LoRA] Temporarily fix long context failure issue (#9579) 2024-10-22 11:32:51 +00:00
Joe Runde ef7faad1b8 🐛 Fixup more test failures from memory profiling (#9563)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-21 17:10:56 -07:00
Cody Yu d11bf435a0 [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Cyrus Leung 051eaf6db3 [Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Cyrus Leung 7e7eae338d [Misc] Standardize RoPE handling for Qwen2-VL (#9250) 2024-10-16 13:56:17 +08:00
Jee Jee Li 36ea79079b [Misc][LoRA] Support loading LoRA weights for target_modules in reg format (#9275) 2024-10-11 12:31:21 +00:00
Prashant Gupta 9ade8bbc8d [Model] add a bunch of supported lora modules for mixtral (#9008)
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
2024-10-04 16:24:40 +00:00
Jee Jee Li 3d49776bbb [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199) 2024-09-29 06:59:45 +00:00
Cyrus Leung 26a68d5d7e [CI/Build] Add test decorator for minimum GPU memory (#8925) 2024-09-29 02:50:51 +00:00
Nick Hill 4b377d6feb [BugFix] Fix test breakages from transformers 4.45 upgrade (#8829) 2024-09-26 16:46:43 -07:00
Jee Jee Li 9b0e3ec970 [Kernel][LoRA] Add assertion for punica sgmv kernels (#7585) 2024-09-23 18:57:42 +00:00
Aaron Pham 9d104b5beb [CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung 6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Woosuk Kwon 561d6f8077 [CI] Change test input in Gemma LoRA test (#8163) 2024-09-04 13:05:50 -07:00
alexeykondrat d1dec64243 [CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-04 11:57:54 -07:00
Alexander Matveev 9db93de20c [Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
jon-chuang 50b8d08dbd [Misc/Testing] Use torch.testing.assert_close (#7324) 2024-08-16 04:24:04 +00:00
Jee Jee Li 97992802f3 [CI/Build]Reduce the time consumption for LoRA tests (#7396) 2024-08-13 17:27:29 -07:00
Jee Jee Li 9118217f58 [LoRA] Relax LoRA condition (#7146) 2024-08-06 01:57:25 +00:00
Jee Jee Li 99d7cabd7b [LoRA] ReplicatedLinear support LoRA (#7081) 2024-08-02 22:40:19 -07:00
Jee Jee Li 7ecee34321 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
Jiaxin Shan 42c7f66a38 [Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Swapnil Parekh 4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
Qubitium-ModelCloud ee93f4f92a [CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
SangBin Cho f5e73c9f1b [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909)
Co-authored-by: sang <sangcho@anyscale.com>
2024-06-30 17:11:15 +00:00
Joe Runde ba4994443a [Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
rohithkrn f5dda63eb5 [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 2024-06-21 15:42:46 -07:00
Jee Li 67005a07bc [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Jinzhen Lin 1f5674218f [Kernel] Add punica dimension for Qwen2 LoRA (#5441) 2024-06-20 17:55:41 -07:00
sergey-tinkoff 07feecde1a [Model] LoRA support added for command-r (#5178) 2024-06-18 11:01:21 -07:00
Joe Runde 5002175e80 [Kernel] Add punica dimensions for Granite 13b (#5559)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Cyrus Leung 0e9164b40a [mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
youkaichao ea3890a5f0 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293)
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
2024-06-12 17:27:08 -07:00
Cyrus Leung 0bfa1c4f13 [Misc] Improve error message when LoRA parsing fails (#5194) 2024-06-10 19:38:49 +08:00
Antoni Baum ccdc490dda [Core] Change LoRA embedding sharding to support loading methods (#5038) 2024-06-06 19:07:57 -07:00
Cyrus Leung 5ae5ed1e60 [Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
raywanb 97b030005c [Model] LoRA gptbigcode implementation (#3949) 2024-05-22 13:58:59 -07:00
SangBin Cho c74c913bfb [misc] remove comments that were supposed to be removed (#4977) 2024-05-22 09:02:58 -04:00
Isotr0py f12c3b5b3d [Model] Add Phi-2 LoRA support (#4886) 2024-05-21 14:24:17 +09:00
SangBin Cho 2e9a2227ec [Lora] Support long context lora (#4787)
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
Silencio 8435b207af [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
Nick Hill 676a99982f [Core] Add MultiprocessingGPUExecutor (#4539)
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Austin Veselka eefeb16464 [Kernel] Full Tensor Parallelism for LoRA Layers (#3524)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Woosuk Kwon 468d761b32 [Misc] Reduce supported Punica dtypes (#4304) 2024-04-23 18:54:33 -07:00
Jee Li d17c8477f1 [Bugfix] Fix LoRA loading check (#4138)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00