Commit Graph

54 Commits

Author SHA1 Message Date
jon-chuang 50b8d08dbd [Misc/Testing] Use torch.testing.assert_close (#7324) 2024-08-16 04:24:04 +00:00
Jee Jee Li 97992802f3 [CI/Build]Reduce the time consumption for LoRA tests (#7396) 2024-08-13 17:27:29 -07:00
Jee Jee Li 9118217f58 [LoRA] Relax LoRA condition (#7146) 2024-08-06 01:57:25 +00:00
Jee Jee Li 99d7cabd7b [LoRA] ReplicatedLinear support LoRA (#7081) 2024-08-02 22:40:19 -07:00
Jee Jee Li 7ecee34321 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
Jiaxin Shan 42c7f66a38 [Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Swapnil Parekh 4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
Qubitium-ModelCloud ee93f4f92a [CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
SangBin Cho f5e73c9f1b [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909)
Co-authored-by: sang <sangcho@anyscale.com>
2024-06-30 17:11:15 +00:00
Joe Runde ba4994443a [Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
rohithkrn f5dda63eb5 [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 2024-06-21 15:42:46 -07:00
Jee Li 67005a07bc [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Jinzhen Lin 1f5674218f [Kernel] Add punica dimension for Qwen2 LoRA (#5441) 2024-06-20 17:55:41 -07:00
sergey-tinkoff 07feecde1a [Model] LoRA support added for command-r (#5178) 2024-06-18 11:01:21 -07:00
Joe Runde 5002175e80 [Kernel] Add punica dimensions for Granite 13b (#5559)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Cyrus Leung 0e9164b40a [mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
youkaichao ea3890a5f0 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293)
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
2024-06-12 17:27:08 -07:00
Cyrus Leung 0bfa1c4f13 [Misc] Improve error message when LoRA parsing fails (#5194) 2024-06-10 19:38:49 +08:00
Antoni Baum ccdc490dda [Core] Change LoRA embedding sharding to support loading methods (#5038) 2024-06-06 19:07:57 -07:00
Cyrus Leung 5ae5ed1e60 [Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
raywanb 97b030005c [Model] LoRA gptbigcode implementation (#3949) 2024-05-22 13:58:59 -07:00
SangBin Cho c74c913bfb [misc] remove comments that were supposed to be removed (#4977) 2024-05-22 09:02:58 -04:00
Isotr0py f12c3b5b3d [Model] Add Phi-2 LoRA support (#4886) 2024-05-21 14:24:17 +09:00
SangBin Cho 2e9a2227ec [Lora] Support long context lora (#4787)
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
Silencio 8435b207af [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
Nick Hill 676a99982f [Core] Add MultiprocessingGPUExecutor (#4539)
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Austin Veselka eefeb16464 [Kernel] Full Tensor Parallelism for LoRA Layers (#3524)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Woosuk Kwon 468d761b32 [Misc] Reduce supported Punica dtypes (#4304) 2024-04-23 18:54:33 -07:00
Jee Li d17c8477f1 [Bugfix] Fix LoRA loading check (#4138)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
Shoichi Uchinami a53222544c [Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) 2024-04-17 10:02:45 -07:00
Antoni Baum 69e1d2fb69 [Core] Refactor model loading code (#4097) 2024-04-16 11:34:39 -07:00
Jee Li 989ae2538d [Kernel] Add punica dimension for Baichuan-13B (#4053) 2024-04-13 07:55:05 -07:00
Jee Li 1096717ae9 [Core] Support LoRA on quantized models (#4012) 2024-04-11 21:02:44 -07:00
Antoni Baum 1e96c3341a Add extra punica sizes to support bigger vocabs (#4015) 2024-04-11 22:18:57 +00:00
youkaichao 63e7176f26 [Core][Refactor] move parallel_utils into vllm/distributed (#3950)
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
2024-04-10 15:33:30 -07:00
Jee Li 11dd6ebb89 [Misc] Avoid loading incorrect LoRA config (#3777) 2024-04-09 19:47:15 -07:00
Cade Daniel e7c7067b45 [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837) 2024-04-09 11:44:15 -07:00
Jee Li 566b57c5c4 [Kernel] support non-zero cuda devices in punica kernels (#3636) 2024-03-27 00:37:42 +00:00
Jee Li 8af890a865 Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
SangBin Cho 01bfb22b41 [CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Zhuohan Li e90fc21f2e [Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Roy f1c0fc3919 Migrate logits computation and gather to model_runner (#3233) 2024-03-20 23:25:01 +00:00
SangBin Cho 6e435de766 [1/n][Chunked Prefill] Refactor input query shapes (#3236) 2024-03-20 14:46:05 -07:00
Antoni Baum fb96c1e98c Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
Or Sharir ae0ccb4017 Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) 2024-03-13 12:18:25 -07:00
Zhuohan Li 4c922709b6 Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Zhuohan Li 2f8844ba08 Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Terry 0bba88df03 Enhance lora tests with more layer and rank variations (#3243) 2024-03-09 17:14:16 -08:00
Woosuk Kwon 929b4f2973 Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen 3b7178cfa4 [Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00