Commit Graph

114 Commits

Author SHA1 Message Date
Sage Moore ce4f5a29fb Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555 703e42ee4b Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon bfdcfa6a05 Support starcoder2 architecture (#3089) 2024-02-29 00:51:48 -08:00
Woosuk Kwon 929b4f2973 Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen 3b7178cfa4 [Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Tao He 71bcaf99e2 Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk e0ade06d63 Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Jared Moore 70f3e8e3a1 Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Ronen Schaffer 4caf7044e0 Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
Woosuk Kwon fd5dcc5c81 Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti 93dc5a2870 chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Nick Hill 7d2dcce175 Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Antoni Baum 017d9f1515 Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Zhuohan Li 63e2a6419d [FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
Ronen Schaffer e433c115bc Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Isotr0py ab3a5a8259 Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li a61f0521b8 [Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
jvmncs 8f36444c4f multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Woosuk Kwon d7afab6d3a [BugFix] Fix GC bug for LLM class (#2882) 2024-02-14 22:17:44 -08:00
Terry 2a543d6efe Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu fe6d09ae61 [Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
Woosuk Kwon f0d4e14557 Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Hongxia Yang 56f738ae9b [ROCm] Fix some kernels failed unit tests (#2498) 2024-02-05 14:25:36 -08:00
Kunshang Ji 96b6f475dd Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz d0d93b92b1 Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
Philipp Moritz 89efcf1ce5 [Minor] Fix test_cache.py CI test failure (#2684) 2024-01-31 10:12:11 -08:00
Vladimir 4f65af0e25 Add swap_blocks unit tests (#2616) 2024-01-30 09:30:50 -08:00
wangding zeng 5d60def02c DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star 9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou 380170038e Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Simon Mo 3a7dd7e367 Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Nikola Borisov 3209b49033 [Bugfix] fix crash if max_tokens=None (#2570) 2024-01-23 22:38:55 -08:00
Antoni Baum 9b945daaf1 [Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu 7a0b011dd5 Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) 2024-01-22 14:47:25 -08:00
Cade Daniel 18bfcdd05c [Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
Zhuohan Li ef9b636e2d Simplify broadcast logic for control messages (#2501) 2024-01-19 11:23:30 -08:00
Simon Mo dd7e8f5f64 refactor complemention api for readability (#2499) 2024-01-18 16:45:14 -08:00
shiyi.c_98 d10f8e1d43 [Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour 14cc317ba4 OpenAI Server refactoring (#2360) 2024-01-16 21:33:14 -08:00
Hyunsung Lee e1957c6ebd Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon Mo 6e01e8c1c8 [CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
陈序 218dc2ccda Aligning top_p and top_k Sampling (#1885)
* Align top_p and top_k with huggingface

* remove _get_prompt_and_output_tokens

* rename _apply_top_p_top_k

* compare top_p top_k with hf

* fix test errors
2024-01-12 22:51:03 +01:00
Cade Daniel 79d64c4954 [Speculative decoding 1/9] Optimized rejection sampler (#2336) 2024-01-09 15:38:41 -08:00
Woosuk Kwon 941767127c Revert the changes in test_cache (#2335) 2024-01-03 17:32:05 -08:00
Zhuohan Li fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Jee Li 77af974b40 [FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
Zhuohan Li 358c328d69 [BUGFIX] Fix communication test (#2285) 2023-12-27 17:18:11 -05:00
Zhuohan Li 4aaafdd289 [BUGFIX] Fix the path of test prompts (#2273) 2023-12-26 10:37:21 -08:00