Commit Graph

  • 407b5537db [Build] Make pypi install work on CPU platform (#12874) wangxiyuan 2025-02-08 17:15:15 +08:00
  • 4ea48fb35c [V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943) Woosuk Kwon 2025-02-08 00:39:09 -08:00
  • e31498bdcb [Misc] Add offline test for disaggregated prefill (#12418) Shaoting 2025-02-08 02:38:20 -06:00
  • 91dd8f7aa6 [bugfix] respect distributed_executor_backend in world_size=1 (#12934) youkaichao 2025-02-08 16:17:08 +08:00
  • d01f66b039 [Bugfix] Fix multi-round chat error when mistral tokenizer is used (#12859) zifeitong 2025-02-07 23:04:34 -08:00
  • cc01223f3b [Misc] Fix typo in the example file (#12896) Ke Zhao 2025-02-08 14:56:43 +08:00
  • 306923da82 [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (#12905) Jee Jee Li 2025-02-08 13:02:53 +08:00
  • 3243158336 [V1] Move KV block hashes from Request to KVCacheManager (#12922) Woosuk Kwon 2025-02-07 19:14:10 -08:00
  • b21f0f9d17 [V1][Minor] Remove outdated comment (#12928) Woosuk Kwon 2025-02-07 19:07:37 -08:00
  • 45cbc4991d [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (#12723) Lu Fang 2025-02-07 16:39:50 -08:00
  • 932c6b7461 [V1] LM Eval With Streaming Integration Tests (#11590) Robert Shaw 2025-02-07 18:07:03 -05:00
  • eaa92d4437 [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501) TJian 2025-02-08 00:13:43 +08:00
  • 0630d4537a [V1] Logprobs and prompt logprobs support (#9880) afeldman-nm 2025-02-07 10:26:20 -05:00
  • 538fab93cd PR #12718 (#12718) Amit Garg 2025-02-07 06:22:37 -08:00
  • ce26b16268 [Misc] Remove unnecessary detokenization in multimodal processing (#12868) Cyrus Leung 2025-02-07 22:21:17 +08:00
  • 1918aa1b80 [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (#12880) Lu Fang 2025-02-07 05:04:39 -08:00
  • 6e1fc61f0f Prevent unecessary requests to huggingface hub (#12837) Maximilien de Bayser 2025-02-07 02:37:41 -03:00
  • aa375dca9f [Bugfix] Missing quant_config in deepseek embedding layer (#12836) Szymon Ożóg 2025-02-07 06:35:09 +01:00
  • 433c4a4923 Make vllm compatible with verl (#12824) ZSL98 2025-02-07 11:54:20 +08:00
  • ef533d25fb [Bugfix] FA2 illegal memory access (#12848) Lucas Wilkinson 2025-02-06 22:54:07 -05:00
  • b260782357 [misc] Revert # 12833 (#12857) Kevin H. Luu 2025-02-06 16:29:12 -08:00
  • 741429a4cd [MISC] Check space in the file names in the pre commit checks (#12804) Lu Fang 2025-02-06 15:36:21 -08:00
  • aff404571b Add Bamba Model (#10909) Yu Chin Fabian Lim 2025-02-07 07:22:42 +08:00
  • 467a96a541 [V1] LoRA Support (#10957) Varun Sundar Rabindranath 2025-02-06 23:02:51 +05:30
  • 8108ac841d [Bugfix] Fix unsupported FA version check for Turing GPU (#12828) Isotr0py 2025-02-07 01:18:22 +08:00
  • afe74f7a96 [Doc] double quote cmake package in build.inc.md (#12840) Jitse Klomp 2025-02-06 18:17:55 +01:00
  • 09b95e36ab [torch.compile] PyTorch 2.6 and nightly compatibility (#12393) youkaichao 2025-02-07 01:09:07 +08:00
  • 85ac82d228 [Kernel] Make rotary_embedding ops more flexible with input shape (#12777) Isotr0py 2025-02-07 00:46:13 +08:00
  • 1e57b1ee63 [Misc] Remove unnecessary decode call (#12833) Cyrus Leung 2025-02-07 00:45:44 +08:00
  • e152f29502 [misc] Reduce number of config file requests to HuggingFace (#12797) Kevin H. Luu 2025-02-06 06:59:18 -08:00
  • c786e757fa [Attention] Use FA3 for MLA on Hopper (#12807) Lucas Wilkinson 2025-02-06 06:43:12 -05:00
  • cefd56ee35 [Docs] Add Google Cloud Slides (#12814) Simon Mo 2025-02-06 01:02:38 -08:00
  • 7ca9934fe7 [Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) Dipika Sikka 2025-02-06 04:02:14 -05:00
  • 0408efc6d0 [Misc] Improve error message for incorrect pynvml (#12809) youkaichao 2025-02-06 15:23:50 +08:00
  • 449d1bce02 [Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793) Michael Goin 2025-02-06 02:16:20 -05:00
  • 1a6fcad4c9 Improve TransformersModel UX (#12785) Harry Mellor 2025-02-06 06:24:57 +00:00
  • 56534cd577 [Bugfix] Fix the test_ultravox.py's license (#12806) Lu Fang 2025-02-05 21:25:54 -08:00
  • d88506dda4 [Model] LoRA Support for Ultravox model (#11253) Sumit Vij 2025-02-05 19:54:13 -08:00
  • 9cdea30b4f [Misc][Easy] Remove the space from the file name Lu Fang 2025-02-05 19:23:35 -08:00
  • 76abd0c881 [Bugfix] Better FP8 supported defaults Lucas Wilkinson 2025-02-05 22:22:19 -05:00
  • 5b19b93082 [ROCm][Kernel] Using the correct warp_size value Gregory Shtrasberg 2025-02-05 22:15:08 -05:00
  • 75404d041b [VLM] Update compatibility with transformers 4.49 Cyrus Leung 2025-02-06 11:09:45 +08:00
  • bf3b79efb8 [VLM] Qwen2.5-VL Roger Wang 2025-02-05 13:31:38 -08:00
  • 9a5b1554b4 [Docs] Drop duplicate [source] links Russell Bryant 2025-02-05 16:30:50 -05:00
  • a4ce74c14a [VLM] Use shared field to pass token ids to model Cyrus Leung 2025-02-06 05:30:46 +08:00
  • 3b2005e1db Add: Support for Sparse24Bitmask Compressed Models Rahul Tuli 2025-02-05 15:30:43 -06:00
  • af8486de49 [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) Sanju C Sudhakaran 2025-02-06 02:59:45 +05:30
  • 4c3aac51e1 Merging PR #12536 Chen Zhang 2025-02-06 05:24:26 +08:00
  • bc1bdecebf [core][distributed] exact ray placement control (#12732) youkaichao 2025-02-06 02:03:19 +08:00
  • 022bcc701a [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546) Akash kaothalkar 2025-02-05 12:41:02 +05:30
  • c53dc466b1 [Doc] Remove performance warning for auto_awq.md (#12743) Michael Goin 2025-02-05 01:43:11 -05:00
  • 3d09e592a8 [V1][Misc] Shorten FinishReason enum and use constant strings (#12760) Nick Hill 2025-02-04 22:43:02 -08:00
  • fcf2e3d7fc [Bugfix] Fix OpenVINO model runner (#12750) Harry Mellor 2025-02-05 06:42:46 +00:00
  • 58b218d7ae [Doc] Update PR Reminder with link to Developer Slack (#12748) Michael Goin 2025-02-05 01:42:09 -05:00
  • 7ff7a638b6 [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634) Kyle Sayers 2025-02-05 00:32:06 -05:00
  • 686006a220 [Misc] Bump the compressed-tensors version (#12736) Dipika Sikka 2025-02-04 23:44:48 -05:00
  • 98fd089fc9 [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) Isotr0py 2025-02-05 12:44:26 +08:00
  • 249824c3bf Refactor Linear handling in TransformersModel (#12727) Harry Mellor 2025-02-05 04:31:12 +00:00
  • 64862d106e [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713) Aleksandr Malyshev 2025-02-04 19:58:22 -08:00
  • b3a0d01e45 [Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368) Aviv Keshet 2025-02-04 18:46:26 -08:00
  • 75e94309e8 [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676) Lucas Wilkinson 2025-02-04 21:22:24 -05:00
  • 233df6f5c4 [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) Mark McLoughlin 2025-02-05 00:46:54 +00:00
  • 18016a5e62 [Bugfix] Fix CI failures for InternVL and Mantis models (#12728) Cyrus Leung 2025-02-04 23:54:23 +08:00
  • 649550f27e [Build] update requirements of no-device for plugin usage (#12630) Sophie du Couédic 2025-02-04 14:19:12 +01:00
  • 62467a834a Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722) Kero Liang 2025-02-04 21:03:19 +08:00
  • 6469038b14 [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689) Michael Greenbaum 2025-02-04 14:58:48 +02:00
  • 815079de8e [VLM] merged multimodal processor and V1 support for idefics3 (#12660) Isotr0py 2025-02-04 20:00:51 +08:00
  • 18a88fcccc [V1] Remove scheduling constraint on partial requests (#12674) Woosuk Kwon 2025-02-04 02:43:58 -08:00
  • d1ca7df84d [VLM] Merged multi-modal processor for InternVL-based models (#12553) Cyrus Leung 2025-02-04 16:44:52 +08:00
  • 96b23621c1 [Misc] Add BNB quantization for Whisper (#12381) Jee Jee Li 2025-02-04 16:27:36 +08:00
  • c36ac98d01 [AMD][ROCm] Enable DeepSeek model on ROCm (#12662) Hongxia Yang 2025-02-04 03:24:11 -05:00
  • 4896d0c2dd [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) Kyle Sayers 2025-02-04 02:27:11 -05:00
  • bb392af434 [Doc] Replace ibm-fms with ibm-ai-platform (#12709) Thomas Parnell 2025-02-04 02:05:04 -05:00
  • 5d98d56089 Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710) Michael Goin 2025-02-03 22:55:46 -05:00
  • 73b35cca7f [Core] Improve hash collision avoidance in prefix caching (#12621) Russell Bryant 2025-02-03 19:28:20 -05:00
  • 5095e96606 [V1] Revert uncache_blocks and support recaching full blocks (#12415) Cody Yu 2025-02-03 15:04:53 -08:00
  • cf58b9c4ca [MISC] Remove model input dumping when exception (#12582) Cody Yu 2025-02-03 13:34:16 -08:00
  • 4797dad3ec [Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) kushanam 2025-02-03 13:30:39 -08:00
  • 6dd5e52823 Squelch MLA warning for Compressed-Tensors Models (#12704) Kyle Sayers 2025-02-03 16:29:56 -05:00
  • c11de33dad [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696) Tyler Michael Smith 2025-02-03 16:04:59 -05:00
  • 33e0602e59 [Misc] Fix improper placement of SPDX header in scripts (#12694) Russell Bryant 2025-02-03 14:16:59 -05:00
  • a1a2aaadb9 [Model]: Add transformers backend support (#11330) Arthur 2025-02-03 14:30:38 +01:00
  • 1298a400e8 [ci/build] fix gh200 test (#12681) youkaichao 2025-02-03 15:59:49 +08:00
  • ad4a9dc817 [cuda] manually import the correct pynvml module (#12679) youkaichao 2025-02-03 15:58:21 +08:00
  • b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) Srikanth Srinivas 2025-02-02 21:46:19 -08:00
  • c5932e5dac Properly check if all fused layers are in the list of targets (#12666) Eldar Kurtic 2025-02-03 06:42:18 +01:00
  • 20579c0fae make sure mistral_common not imported for non-mistral models (#12669) youkaichao 2025-02-03 13:40:25 +08:00
  • 95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574) Yang Chen 2025-02-02 21:09:50 -08:00
  • 326fcc8b9f [Doc] Deprecate Discord (#12668) Zhuohan Li 2025-02-02 19:19:56 -08:00
  • e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667) youkaichao 2025-02-03 09:32:18 +08:00
  • e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628) Russell Bryant 2025-02-02 14:58:18 -05:00
  • f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392) Kunshang Ji 2025-02-02 18:17:26 +08:00
  • f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) Shawn Du 2025-02-02 16:40:58 +08:00
  • abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653) Woosuk Kwon 2025-02-01 23:43:20 -08:00
  • e497f33491 [Core] Silence unnecessary deprecation warnings (#12620) Russell Bryant 2025-02-02 02:35:50 -05:00
  • baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648) Jinzhen Lin 2025-02-02 15:29:56 +08:00
  • b4e5c03306 doc: fixing minor typo in readme.md (#12643) Vicente Herrera 2025-02-01 18:17:29 +01:00
  • 3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) Michael Goin 2025-02-01 11:16:19 -05:00
  • 4f4d427ac2 Disable chunked prefill and/or prefix caching when MLA is enabled (#12642) Simon Mo 2025-01-31 23:46:57 -08:00
  • 1e3698393f [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280) Russell Bryant 2025-02-01 02:13:10 -05:00