Commit Graph

  • 4046466534 env var to disable pinned memory main wassname 2025-03-07 08:18:58 +08:00
  • 0578e5a462 [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (#14310) Chengji Yao 2025-03-06 15:31:05 -08:00
  • 04222984f8 [Docs] Add nsight guide to profiling docs (#14298) Michael Goin 2025-03-06 17:19:58 -05:00
  • 6832707e90 [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (#14221) Michael Goin 2025-03-06 17:18:29 -05:00
  • 6b2ef5cd17 [Bug] Fix Attention when ignored in by quant_method (#14313) Michael Goin 2025-03-06 17:18:06 -05:00
  • 958adce478 [Bugfix] Fix use_direct_call condition in FusedMoE layer for (#14382) Tyler Michael Smith 2025-03-06 17:17:21 -05:00
  • 99b0915d3b [Kernel] Add needs_fixed_stride_order tag to most GEMMs (#14306) Tyler Michael Smith 2025-03-06 17:17:09 -05:00
  • 8ca2b21c98 [CI] Disable spawn when running V1 Test (#14345) Thomas Parnell 2025-03-06 22:52:46 +01:00
  • d9292786e1 [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (#13569) Michael Goin 2025-03-06 16:08:36 -05:00
  • cc2f9b32c8 [Distributed] Add enable_expert_parallel arg (#14305) Tyler Michael Smith 2025-03-06 13:54:45 -05:00
  • cd579352bf [V1] Do not detokenize if sampling param detokenize is False (#14224) Himanshu Jaju 2025-03-06 19:40:24 +01:00
  • 9f1710f1ac Fix mla prefill context performance (#13897) Ying Zhong 2025-03-07 01:35:49 +08:00
  • e642ec962c Add authors to license header. (#14371) Thomas Parnell 2025-03-06 17:43:09 +01:00
  • ada19210a3 Adding cpu inference with VXE ISA for s390x architecture (#12613) Dilip Gowda Bhagavan 2025-03-06 22:10:53 +05:30
  • bf0560bda9 Reinstate best_of for V0 (#14356) Harry Mellor 2025-03-06 17:34:22 +01:00
  • 151b08e0fe [RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185) youkaichao 2025-03-07 00:32:46 +08:00
  • 81b2f4a45f [Doc] Fix date typo in README.md (#14366) Jitse Klomp 2025-03-06 17:29:57 +01:00
  • 82551ad616 [Core] Don't use cache during multi-modal profiling (#14336) Cyrus Leung 2025-03-07 00:03:31 +08:00
  • caac5c2e59 [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (#14326) courage17340 2025-03-06 23:59:32 +08:00
  • 6bd1dd9d26 [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152) Thomas Parnell 2025-03-06 16:39:16 +01:00
  • 4f27044aab [Doc] Correct beam_search using in generative_models.md (#14363) Irina Yuryeva 2025-03-06 18:37:10 +03:00
  • 0ddc991f5c [Doc] Update reasoning with stream example to use OpenAI library (#14077) Yanyi Liu 2025-03-06 21:20:37 +08:00
  • fa82b93853 [Frontend][Docs] Transcription API streaming (#13301) Nicolò Lucchesi 2025-03-06 11:39:35 +01:00
  • 69ff99fdcd [Core] Optimizing cross-attention QKVParallelLinear computation (#12325) Nicolò Lucchesi 2025-03-06 10:37:26 +01:00
  • 5d802522a7 [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (#14275) lkchen 2025-03-06 00:58:41 -08:00
  • 1769928079 [Model] Update Paligemma multimodal processing with PromptUpdate (#14015) kYLe 2025-03-06 02:31:38 -06:00
  • ed6ea06577 [Hardware] Update the flash attn tag to support Blackwell (#14244) Pavani Majety 2025-03-05 22:01:37 -08:00
  • 5ee10e990d [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301) Nicolò Lucchesi 2025-03-06 05:00:53 +01:00
  • 3dbd2d813a [V1] LoRA - Enable more V1 tests (#14315) Varun Sundar Rabindranath 2025-03-05 22:55:42 -05:00
  • f5f7f00cd9 [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (#14114) Ce Gao 2025-03-06 11:49:20 +08:00
  • abcc61e0af [misc] Mention ray list nodes command to troubleshoot ray issues (#14318) Rui Qiao 2025-03-05 18:00:36 -08:00
  • f6bb18fd9a [BugFix] MLA + V1, illegal memory access and accuracy issues (#14253) Lucas Wilkinson 2025-03-05 20:10:13 -05:00
  • 71eaf8969b [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (#13850) Yuan Tang 2025-03-05 20:09:29 -05:00
  • ca100c90fe Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917) Michael Goin 2025-03-05 20:08:51 -05:00
  • ffad94397d [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (#14243) Russell Bryant 2025-03-05 20:08:02 -05:00
  • 4dacaa4a83 [BugFix] Fix prefix caching V0 MLA (#14255) Lucas Wilkinson 2025-03-05 20:07:42 -05:00
  • a7ea35aa67 [Bugfix] Remove num_tokens_across_dp (#14302) Tyler Michael Smith 2025-03-05 18:55:55 -05:00
  • 1e3e76b6cc [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (#14237) pyc96 2025-03-05 14:22:40 -08:00
  • 53ea6ad830 [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308) Lu Fang 2025-03-05 13:41:18 -08:00
  • 1b7624bf5c [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (#14267) Serena 2025-03-06 05:28:50 +08:00
  • ac60dc7fe1 [V1][BugFix] Fix for mixed top_k batch (#14301) Nick Hill 2025-03-05 12:43:04 -08:00
  • a4f1ee35d6 Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997) Vincent 2025-03-05 15:22:43 -05:00
  • a32c8669ca [V1][Minor] Remove obsolete FIXME comment (#14304) Nick Hill 2025-03-05 11:59:23 -08:00
  • ca2ca8de57 [Docs] Add Meta Slides (#14297) Simon Mo 2025-03-05 08:30:23 -08:00
  • f71b00a19e [Bugfix] Fix broken vision language example (#14292) Isotr0py 2025-03-05 23:57:10 +08:00
  • 8f808cf86e prefix_caching.md: Fixed typo (#14293) DaividFrank 2025-03-05 16:43:13 +01:00
  • 7bab4bb048 [Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276) Jee Jee Li 2025-03-05 23:11:29 +08:00
  • e17e4488bd [LoRA] Remove linear hack outside transformers backend (#14177) Isotr0py 2025-03-05 23:06:28 +08:00
  • 257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159) Robert Shaw 2025-03-05 14:18:55 +00:00
  • 47d4a7e004 Small update for external_launcher backend docs (#14288) Zhe Zhang 2025-03-05 05:30:00 -08:00
  • 7f89a594dd [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278) Cyrus Leung 2025-03-05 20:29:50 +08:00
  • 961644e6a8 [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217) Iacopo Poli 2025-03-05 12:44:10 +01:00
  • 8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169) Lu Fang 2025-03-05 00:49:44 -08:00
  • ec79b67c77 [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256) Roger Wang 2025-03-04 23:37:16 -08:00
  • 32985bed7c [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066) Benjamin Chislett 2025-03-05 01:30:40 -05:00
  • dae9ec464c Temporarily disable test_awq_gemm_opcheck (#14251) Michael Goin 2025-03-05 01:10:35 -05:00
  • 6eaf93020d [platforms] improve rocm debugging info (#14257) youkaichao 2025-03-05 13:32:18 +08:00
  • 72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) Tyler Michael Smith 2025-03-05 00:27:26 -05:00
  • 0a995d5434 [Model] New model support for Phi-4-multimodal-instruct (#14119) Congcong Chen 2025-03-04 20:57:01 -08:00
  • ade3f7d988 [V1][Bugfix] Do not reset prefix caching metrics (#14235) Cody Yu 2025-03-04 20:39:13 -08:00
  • 0df25101d6 [Bugfix] Fix gptq_marlin for deepseek-v3 (#13750) rainkert 2025-03-05 12:25:53 +08:00
  • e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157) Michael Goin 2025-03-04 23:25:24 -05:00
  • 5b143d33be Moved numba from common requirements to cuda/rocm specific requirements (#14199) Nishidha 2025-03-05 09:55:00 +05:30
  • eb59b5a6cb [misc] announce china meetup (#14248) youkaichao 2025-03-05 10:33:50 +08:00
  • fbfc3ee37e [V1][TPU] TPU multimodal model support for ragged attention (#14158) Michael Goin 2025-03-04 19:58:48 -05:00
  • 3e1d223626 [ROCm] Disable a few more kernel tests that are broken on ROCm (#14145) Sage Moore 2025-03-04 15:37:55 -08:00
  • 4f5b059f14 Clean up unused padding_idx variables across many model definitions (#13240) Tyler Michael Smith 2025-03-04 16:27:00 -05:00
  • 288ca110f6 [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (#14228) Kuntai Du 2025-03-04 15:10:32 -06:00
  • c2bd2196fc [v1][Metrics] Add design doc (#12745) Mark McLoughlin 2025-03-04 20:36:55 +00:00
  • 550c7ba3dc [Docs] Update Dockerfile dependency image (#14215) Michael Goin 2025-03-04 15:22:11 -05:00
  • e5b2f1601a [Frontend] Do prompt_logprobs clamping for chat as well as completions (#14225) Harry Mellor 2025-03-04 21:13:06 +01:00
  • 9badee53de Fix performance when --generation-config is not None (#14223) Harry Mellor 2025-03-04 20:59:22 +01:00
  • beebf4742a [TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988) Siyuan Liu 2025-03-04 11:40:06 -08:00
  • f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) kushanam 2025-03-04 07:55:07 -08:00
  • b3cf368d79 [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (#14161) lkchen 2025-03-04 07:43:59 -08:00
  • c8525f06fc [V0][Metrics] Deprecate some questionable request time metrics (#14135) Mark McLoughlin 2025-03-04 15:11:33 +00:00
  • 5db6b2c961 [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869) Nick Hill 2025-03-04 07:06:47 -08:00
  • 6247bae6c6 [Bugfix] Restrict MacOS CPU detection (#14210) Michael Goin 2025-03-04 09:25:27 -05:00
  • 3610fb4930 [doc] add "Failed to infer device type" to faq (#14200) youkaichao 2025-03-04 20:47:06 +08:00
  • 71c4b40562 [sleep mode] error out with expandable_segments (#14189) youkaichao 2025-03-04 18:54:19 +08:00
  • ac65bc92df [platform] add debug logging during inferring the device type (#14195) youkaichao 2025-03-04 18:39:16 +08:00
  • f78c0be80a Fix benchmark_moe.py tuning for CUDA devices (#14164) Michael Goin 2025-03-04 00:11:03 -05:00
  • 66233af7b6 Use math.prod instead of np.prod for trivial ops (#14142) Zhanwen Chen 2025-03-04 00:09:22 -05:00
  • bf13d40972 [core] Pass all driver env vars to ray workers unless excluded (#14099) Rui Qiao 2025-03-03 19:44:17 -08:00
  • 989f4f430c [Misc] Remove lru_cache in NvmlCudaPlatform (#14156) Cody Yu 2025-03-03 19:09:34 -08:00
  • bb5b640359 [core] moe fp8 block quant tuning support (#14068) Divakar Verma 2025-03-03 19:30:23 -06:00
  • c060b71408 [Model] Add support for GraniteMoeShared models (#13313) Travis Johnson 2025-03-03 17:04:52 -07:00
  • 79e4937c65 [v1] Add comments to the new ragged paged attention Pallas kernel (#14155) iefgnoix 2025-03-03 15:00:55 -08:00
  • cd1d3c3df8 [Docs] Add GPTQModel (#14056) Qubitium-ModelCloud 2025-03-04 05:59:09 +08:00
  • 19d98e0c7d [Kernel] Optimize moe intermediate_cache usage (#13625) Michael Goin 2025-03-03 16:29:53 -05:00
  • 2b04c209ee [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100) Michael Goin 2025-03-03 16:20:24 -05:00
  • ae122b1cbd [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055) Mark McLoughlin 2025-03-03 19:04:45 +00:00
  • 872db2be0e [V1] Simplify stats logging (#14082) Nick Hill 2025-03-03 10:34:14 -08:00
  • 2dfdfed8a0 [V0][Metrics] Deprecate some KV/prefix cache metrics (#14136) Mark McLoughlin 2025-03-03 18:25:46 +00:00
  • c41d27156b [V0][Metrics] Remove unimplemented vllm:tokens_total (#14134) Mark McLoughlin 2025-03-03 17:50:22 +00:00
  • 91373a0d15 Fix head_dim not existing in all model configs (Transformers backend) (#14141) Harry Mellor 2025-03-03 17:48:11 +00:00
  • 848a6438ae [ROCm] Faster Custom Paged Attention kernels (#12348) TJian 2025-03-04 01:24:45 +08:00
  • 98175b2816 Improve the docs for TransformersModel (#14147) Harry Mellor 2025-03-03 17:03:05 +00:00
  • 4167252eaf [V1] Refactor parallel sampling support (#13774) Mark McLoughlin 2025-03-03 16:15:27 +00:00
  • f35f8e2242 [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921) Cody Yu 2025-03-03 00:43:14 -08:00