Commit Graph

  • b87c21fc89 [Misc][Platform] Move use allgather to platform (#14010) Mengqing Cao 2025-03-03 15:40:04 +08:00
  • e584b85afd [Misc] duplicate code in deepseek_v2 (#14106) wang.yuqi 2025-03-03 14:10:11 +08:00
  • 09e56f9262 [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) Sheng Yao 2025-03-03 09:35:01 +08:00
  • cf069aa8aa Update deprecated Python 3.8 typing (#13971) Harry Mellor 2025-03-03 01:34:51 +00:00
  • bf33700ecd [v0][structured output] Support reasoning output (#12955) Ce Gao 2025-03-03 03:49:42 +08:00
  • bc6ccb9878 [Doc] Source building add clone step (#14086) qux-bbb 2025-03-02 18:59:50 +08:00
  • 82fbeae92b [Misc] Accurately capture the time of loading weights (#14063) Jun Duan 2025-03-01 20:20:30 -05:00
  • cc5e8f6db8 [Model] Add LoRA support for TransformersModel (#13770) Jee Jee Li 2025-03-02 09:17:34 +08:00
  • d54990da47 [v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) Chen Zhang 2025-03-02 04:46:02 +08:00
  • b9f1d4294e [v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) Chen Zhang 2025-03-01 16:25:54 +08:00
  • b28246f6ff [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065) Sage Moore 2025-02-28 23:18:32 -08:00
  • 3b5567a209 [V1][Minor] Do not print attn backend twice (#13985) Woosuk Kwon 2025-02-28 23:09:14 -08:00
  • fdcc405346 [Doc] Consolidate whisper and florence2 examples (#14050) Isotr0py 2025-03-01 14:49:15 +08:00
  • 8994dabc22 [Documentation] Add more deployment guide for Kubernetes deployment (#13841) Kuntai Du 2025-02-28 22:44:24 -08:00
  • 02296f420d [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) Li, Jiang 2025-03-01 14:31:01 +08:00
  • 6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) YajieWang 2025-03-01 14:30:59 +08:00
  • 6a84164add [Bugfix] Add file lock for ModelScope download (#14060) Jee Jee Li 2025-03-01 14:10:28 +08:00
  • f64ffa8c25 [Docs] Add pipeline_parallel_size to optimization docs (#14059) Brayden Zhong 2025-03-01 00:43:54 -05:00
  • bd56c983d6 [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902) Luka Govedič 2025-02-28 18:20:11 -05:00
  • 084bbac8cc [core] Bump ray to 2.43 (#13994) Rui Qiao 2025-02-28 13:47:44 -08:00
  • 28943d36ce [v1] Move block pool operations to a separate class (#13973) Chen Zhang 2025-03-01 04:53:31 +08:00
  • b526ca6726 Add RELEASE.md (#13926) Andrey Talman 2025-02-28 20:25:50 +00:00
  • e7bd944e08 [v1] Cleanup the BlockTable in InputBatch (#13977) Chen Zhang 2025-03-01 03:03:16 +08:00
  • c3b6559a10 [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379) iefgnoix 2025-02-28 10:01:36 -08:00
  • 4be4b26cb7 Fix entrypoint tests for embedding models (#14052) Harry Mellor 2025-02-28 16:56:44 +00:00
  • 2aed2c9fa7 [Doc] Fix ROCm documentation (#14041) Brayden Zhong 2025-02-28 11:42:07 -05:00
  • 9b61dd41e7 [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031) Yang Liu 2025-02-28 23:36:08 +08:00
  • f7bee5c815 [VLM][Bugfix] Enable specifying prompt target via index (#14038) Cyrus Leung 2025-02-28 23:35:55 +08:00
  • e0734387fb [Bugfix] Fix MoeWNA16Method activation (#14024) Jee Jee Li 2025-02-28 23:22:42 +08:00
  • f58f8b5c96 Update AutoAWQ docs (#14042) Harry Mellor 2025-02-28 15:20:29 +00:00
  • b3f7aaccd0 [V1][Minor] Restore V1 compatibility with LLMEngine class (#13090) Thibault Schueller 2025-02-28 09:52:25 +01:00
  • b91660ddb8 [Hardware][Intel-Gaudi] Regional compilation support (#13213) Kacper Pietkun 2025-02-28 09:51:49 +01:00
  • 76c89fcadd Use smaller embedding model when not testing model specifically (#13891) Harry Mellor 2025-02-28 08:50:43 +00:00
  • b9e41734c5 [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987) Mathis Felardos 2025-02-28 08:53:45 +01:00
  • 1088f06242 [Doc] Move multimodal Embedding API example to Online Serving page (#14017) Cyrus Leung 2025-02-28 15:12:04 +08:00
  • 73e0225ee9 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#13911) Travis Johnson 2025-02-27 21:00:45 -07:00
  • 6c85da3a18 [V1]SupportsV0Only protocol for model definitions (#13959) Roger Wang 2025-02-27 17:02:15 -08:00
  • 67fc426845 [Misc] Print FusedMoE detail info (#13974) Jee Jee Li 2025-02-28 07:53:13 +08:00
  • 9804145cac [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626) Benjamin Chislett 2025-02-27 18:28:08 -05:00
  • 2e94b9cfbb [Attention] Flash MLA for V1 (#13867) Lucas Wilkinson 2025-02-27 18:03:41 -05:00
  • 8294773e48 [core] Perf improvement for DSv3 on AMD GPUs (#13718) qli88 2025-02-27 16:14:30 -06:00
  • cd813c6d4d [V1][Minor] Minor cleanup for GPU Model Runner (#13983) Woosuk Kwon 2025-02-27 13:11:40 -08:00
  • 38acae6e97 [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970) Sage Moore 2025-02-27 12:31:47 -08:00
  • a2dd48c386 [VLM] Deprecate legacy input mapper for OOT multimodal models (#13979) Cyrus Leung 2025-02-28 03:14:55 +08:00
  • 126f6beeb4 Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742) dependabot[bot] 2025-02-27 19:04:10 +00:00
  • 58d1b2aa77 [Attention] MLA support for V1 (#13789) Yang Chen 2025-02-27 10:14:17 -08:00
  • f1579b229d [VLM] Generalized prompt updates for multi-modal processor (#13964) Cyrus Leung 2025-02-28 01:44:25 +08:00
  • 7864875879 [Bugfix] Fix qwen2.5-vl overflow issue (#13968) Isotr0py 2025-02-28 01:30:39 +08:00
  • 1dd422b64a Update LMFE version to v0.10.11 to support new versions of transforme… (#13930) Noam Gat 2025-02-27 19:16:12 +02:00
  • 06c8f8d885 [bugfix] Fix profiling for RayDistributedExecutor (#13945) Rui Qiao 2025-02-27 09:01:21 -08:00
  • 5677c9bb3e Deduplicate .pre-commit-config.yaml's exclude (#13967) Harry Mellor 2025-02-27 16:27:47 +00:00
  • 512d77d582 Update quickstart.md (#13958) 王博伟 2025-02-28 00:05:11 +08:00
  • 7f0be2aa24 [Model] Deepseek GGUF support (#13167) Szymon Ożóg 2025-02-27 11:08:35 +01:00
  • edf309ebbe [VLM] Support multimodal inputs for Florence-2 models (#13320) Isotr0py 2025-02-27 18:06:41 +08:00
  • 788f284b53 Fix test_block_fp8.py test for MoE (#13915) Michael Goin 2025-02-27 05:00:00 -05:00
  • 4b1d141f49 [PP] Correct cache size check (#13873) Yang Zheng 2025-02-27 17:47:29 +08:00
  • 10c3b8c1cf [Misc] fixed 'required' is an invalid argument for positionals (#13948) Chauncey 2025-02-27 17:06:49 +08:00
  • a7f37314b7 [CI/Build] Add examples/ directory to be labelled by mergify (#13944) Brayden Zhong 2025-02-27 03:24:11 -05:00
  • cd711c48b2 [V1][Metrics] Handle preemptions (#13169) Mark McLoughlin 2025-02-27 04:04:59 +00:00
  • 378b3ef6f8 [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding (#13922) Sage Moore 2025-02-26 20:04:12 -08:00
  • c9944acbf9 [misc] Rename Ray ADAG to Compiled Graph (#13928) Rui Qiao 2025-02-26 20:03:28 -08:00
  • ca377cf1b9 Use CUDA 12.4 as default for release and nightly wheels (#12098) Michael Goin 2025-02-26 22:06:37 -05:00
  • a31614e386 [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-02-27 04:39:10 +02:00
  • f95903909f [Kernel] FlashMLA integration (#13747) Lucas Wilkinson 2025-02-26 21:35:08 -05:00
  • b382a7f28f [BugFix] Make FP8 Linear compatible with torch.compile (#13918) Woosuk Kwon 2025-02-26 13:48:55 -08:00
  • 4cb6fa0a9c [Bugfix] Backend option to disable xgrammar any_whitespace (#12744) Wallas Henrique 2025-02-26 15:52:34 -03:00
  • d08b285adf [Misc] fixed qwen_vl_utils parameter error (#13906) Chauncey 2025-02-27 00:31:53 +08:00
  • b27122acc2 [TPU] use torch2.6 with whl package (#13860) Chenyaaang 2025-02-26 05:18:54 -08:00
  • 934bb99c71 [Bugfix] Update expected token counts for Ultravox tests (#13895) Cyrus Leung 2025-02-26 20:56:50 +08:00
  • 3f808cc044 [Bugfix] Do not crash V0 engine on input errors (#13101) Joe Runde 2025-02-26 04:07:29 -07:00
  • ec8a5e5386 [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor (#13736) Brayden Zhong 2025-02-26 06:06:47 -05:00
  • 215bf150a6 [Bugfix] Handle None parameters in Mistral function calls. (#13786) Florian Greinacher 2025-02-26 12:06:21 +01:00
  • 0ecdd98031 Add comments on accessing kv_cache and attn_metadata (#13887) Harry Mellor 2025-02-26 10:41:02 +00:00
  • 7b700ec8c8 [Bugfix] Add test example for Ultravox v0.5 (#13890) Cyrus Leung 2025-02-26 18:31:43 +08:00
  • 7ca1da020f [Misc] Fix input processing for Ultravox (#13871) Roger Wang 2025-02-25 23:56:34 -08:00
  • 5157338ed9 [Misc] Improve LoRA spelling (#13831) Jee Jee Li 2025-02-26 15:43:01 +08:00
  • e206b54331 [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine (#13837) Seth Kimmel 2025-02-25 22:58:24 -08:00
  • 1d35662e6d [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844) Sage Moore 2025-02-25 22:56:58 -08:00
  • e656f638de [Doc] fix the incorrect module path of tensorize_vllm_model (#13863) Albert 2025-02-26 14:56:19 +08:00
  • 145944cb94 Improve pipeline partitioning (#13839) Harry Mellor 2025-02-26 02:53:56 +00:00
  • 094b7d9496 [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) Henry Tsang 2025-02-25 18:52:03 -08:00
  • e1fe7591f2 [Misc]Code Cleanup (#13859) Chenguang Li 2025-02-26 10:44:30 +08:00
  • 5629f26df7 [V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729) Lily Liu 2025-02-25 18:14:48 -08:00
  • 9ba28043b5 [misc] Show driver IP info when Ray fails to allocate driver worker (#13858) Rui Qiao 2025-02-25 17:53:43 -08:00
  • 24679788ed DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833) Harry Mellor 2025-02-26 01:24:57 +00:00
  • 07c4353057 [Model] Support Grok1 (#13795) Michael Goin 2025-02-25 20:07:12 -05:00
  • 34e3494e70 Fix failing MyGemma2Embedding test (#13820) Harry Mellor 2025-02-25 20:33:03 +00:00
  • f75aa72732 [Neuron] Add custom_ops for neuron backend (#13246) Liangfu Chen 2025-02-25 11:47:49 -08:00
  • 340e39e387 Fix string parsing error (#13825) Chen1022 2025-02-26 00:20:29 +08:00
  • f4133ce4e5 [Bugfix] Revert inspection code in #13743 (#13832) Cyrus Leung 2025-02-26 00:18:50 +08:00
  • 6522d55b6f Fix /v1/audio/transcriptions Bad Request Error (#13811) Wen Sun 2025-02-25 22:03:33 +08:00
  • 6ff518626c [Bugfix] Fix deepseek-vl2 inference with more than 2 images (#13818) Isotr0py 2025-02-25 22:03:02 +08:00
  • fa82074167 [Bugfix] Flush TunableOp results before worker processes are destroyed. (#13623) Nichols A. Romero 2025-02-25 05:08:20 -06:00
  • 75e9d49796 [Bugfix] Initialize attention bias on the same device as Query/Key/Value (#13468) Junlin Zhou 2025-02-25 18:13:09 +08:00
  • 32c3b6bfd1 [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs (#13724) Chen1022 2025-02-25 18:12:19 +08:00
  • 37b6cb4985 [CI/Build] Fix V1 LoRA failure (#13767) Jee Jee Li 2025-02-25 18:01:15 +08:00
  • aabeb2688f [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) Gregory Shtrasberg 2025-02-25 03:39:59 -05:00
  • 2f42a4888c [Feature] Support KV cache offloading and disagg prefill with LMCache connector. (#12953) Jiayi Yao 2025-02-25 02:38:42 -06:00
  • 3173c3b34e [misc] Clean up ray compiled graph type hints (#13731) Rui Qiao 2025-02-25 00:37:08 -08:00
  • 2d87d7d1ac [Bugfix] Modify modelscope api usage in transformer_utils (#13807) Shanshan Shen 2025-02-25 16:36:07 +08:00