wassname/vllm - vllm - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/vllm.git synced 2026-07-05 07:23:34 +08:00

Author	SHA1	Message	Date
Murali Andoorveedu	5eda2ea02a	[Core][1/N] Support send/recv in PyNCCL Groups (#4988 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-05-23 09:54:48 -07:00
Letian Li	2ba80bed27	[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009 )	2024-05-23 09:08:58 -07:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
Cody Yu	ee3eea0a1b	[Misc] Take user preference in attention selector (#4960 )	2024-05-23 07:55:56 +09:00
Philipp Moritz	a36de682d4	[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991 )	2024-05-22 22:26:56 +00:00
Nick Hill	eb6d3c264d	[Core] Eliminate parallel worker per-step task scheduling overhead (#4894 )	2024-05-23 06:17:27 +09:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
Cody Yu	a3a73ab069	[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893 ) The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).	2024-05-22 13:28:20 -07:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
SangBin Cho	c74c913bfb	[misc] remove comments that were supposed to be removed (#4977 )	2024-05-22 09:02:58 -04:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
sasha0552	9b9a10d6cb	[Frontend] Dynamic RoPE scaling (#4638 )	2024-05-22 01:32:35 -04:00
Isotr0py	99eff67ba9	[Bugfix][Kernel] Add head size check for attention backend selection (#4944 )	2024-05-21 15:33:25 -04:00
Kante Yin	14772eeb8e	[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935 ) Signed-off-by: kerthcet <kerthcet@gmail.com>	2024-05-21 09:30:52 -07:00
Michael Goin	757b62c495	[CI/Build] Codespell ignore `build/` directory (#4945 )	2024-05-21 09:06:10 -07:00
Simon Mo	e941f88584	[Docs] Add acknowledgment for sponsors (#4925 )	2024-05-21 00:17:25 -07:00
Isotr0py	f12c3b5b3d	[Model] Add Phi-2 LoRA support (#4886 )	2024-05-21 14:24:17 +09:00
HUANG Fei	d130b573a0	[Model] add rope_scaling support for qwen2 (#4930 )	2024-05-21 05:22:22 +00:00
Antoni Baum	65ae8c2c8f	[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897 )	2024-05-20 17:48:32 -07:00
Kuntai Du	c3af44722c	[Doc]Add documentation to benchmarking script when running TGI (#4920 )	2024-05-20 20:16:57 +00:00
Aurick Qiao	1937e29848	[Core] Sharded State Loader download from HF (#4889 )	2024-05-20 11:46:12 -07:00
Mor Zusman	f0eecee610	[Bugfix] Fix dummy weight for fp8 (#4916 ) Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <morz@ai21.com>	2024-05-20 18:44:25 +00:00
Alexei-V-Ivanov-AMD	943e72ca56	[Build/CI] Enabling AMD Entrypoints Test (#4834 ) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>	2024-05-20 11:29:28 -07:00
Wenwei Zhang	546a97ef69	[Misc]: allow user to specify port in distributed setting (#4914 )	2024-05-20 17:45:06 +00:00
Alexander Matveev	da5a0b539d	Remove marlin warning (#4918 )	2024-05-20 14:55:34 +00:00
Cyrus Leung	6287537a0c	[Model] LLaVA model refactor (#4910 )	2024-05-20 08:11:25 +00:00
Woosuk Kwon	b57e6c5949	[Kernel] Add flash-attn back (#4907 )	2024-05-19 18:11:30 -07:00
Alexander Matveev	27ce85476e	[Kernel] Add marlin_24 unit tests (#4901 )	2024-05-19 11:37:34 -04:00
Cyrus Leung	f68470e803	[Bugfix][Model] Add base class for vision-language models (#4809 )	2024-05-19 00:13:33 -07:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00
alexeykondrat	c0724fc915	[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658 )	2024-05-18 05:09:11 +00:00
Michael Goin	86b45ae065	[Bugfix] Relax tiktoken to >= 0.6.0 (#4890 )	2024-05-17 12:58:52 -06:00
Antoni Baum	c5711ef985	[Doc] Update Ray Data distributed offline inference example (#4871 )	2024-05-17 10:52:11 -07:00
eigenLiu	48d5985a08	Sync huggingface modifications of qwen Moe model (#4774 )	2024-05-17 09:43:19 -07:00
Jinzhen Lin	33e0823de5	[Bugfix] fix rope error when load models with different dtypes (#4835 )	2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD	26148120b3	[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797 )	2024-05-16 20:58:25 -07:00
bofeng huang	0150a10630	[Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688 )	2024-05-16 18:47:22 -07:00
Kante Yin	8e7fb5d43a	Support to serve vLLM on Kubernetes with LWS (#4829 ) Signed-off-by: kerthcet <kerthcet@gmail.com>	2024-05-16 16:37:29 -07:00
Woosuk Kwon	9a31a817a8	[Bugfix] Fix FP8 KV cache support (#4869 )	2024-05-16 22:42:29 +00:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
youkaichao	10fa9eea21	[Misc] remove old comments (#4866 )	2024-05-16 11:07:41 -07:00
youkaichao	e08188081b	[Core][Distributed] remove graph mode function (#4818 )	2024-05-16 10:59:52 -07:00
Hongxia Yang	b5853f9963	[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845 )	2024-05-16 10:46:52 -07:00
Simon Mo	f09edd8a25	Add JSON output support for benchmark_latency and benchmark_throughput (#4848 )	2024-05-16 10:02:56 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Pierre Dulac	9216b9cc38	[Bugfix] Bypass authorization API token for preflight requests (#4862 )	2024-05-16 09:42:21 -07:00
Alex Wu	5e0391c040	[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851 )	2024-05-17 00:42:41 +09:00
Alex Wu	dbc0754ddf	[docs] Fix typo in examples filename openi -> openai (#4864 )	2024-05-17 00:42:17 +09:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00

1 2 3 4 5 ...

1398 Commits