Commit Graph

4763 Commits

Author SHA1 Message Date
Huy Do e7ef74e26e Fix some issues with benchmark data output (#13641)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-02-24 10:23:18 +08:00
Nick Hill cbae7af552 [V1][BugFix] Fix engine core client shutdown hangs (#13298)
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.

Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.

Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-23 13:07:43 -08:00
youkaichao eb24dc4a45 [v1] torchrun compatibility (#13642)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-23 22:47:24 +08:00
Roger Wang 9bebc9512f [Misc] Deprecate --dataset from benchmark_serving.py (#13708)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-23 13:32:20 +00:00
Nick Hill 5a2ba16f5c [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms (#13688) 2025-02-23 02:54:29 -08:00
Isotr0py ba5106e519 [LMM] Implement merged multimodal processor for whisper (#13278) 2025-02-23 01:46:03 -08:00
Kyle Sayers d5ca2110f1 [Quant] BaiChuan SupportsQuant (#13710) 2025-02-22 19:21:15 -08:00
Kevin H. Luu 2c5e637b57 [ci] Use env var to control whether to use S3 bucket in CI (#13634) 2025-02-22 19:19:45 -08:00
Andy Lo 322d2a27d6 [BugFix] Minor: logger import in attention backend (#13706)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-02-22 16:51:13 -08:00
Roger Wang 82e0d601fc [CI/Build] Fix pre-commit errors from #13571 (#13709)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-22 16:50:38 -08:00
Daniele 78ac0f591d [CI/Build] fix uv caching in Dockerfile (#13611) 2025-02-22 08:25:20 -08:00
Yan Ma b56155e7f3 [XPU]fix setuptools version for xpu (#13548) 2025-02-22 08:05:35 -08:00
Helena Kloosterman 382f66fb08 [Bugfix] Fix boolean conversion for OpenVINO env variable (#13615) 2025-02-22 08:04:12 -08:00
Cyrus Leung 8354f6640c [Doc] Dockerfile instructions for optional dependencies and dev transformers (#13699) 2025-02-22 06:04:31 -08:00
Gregory Shtrasberg c904fdddf6 [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm (#13231) 2025-02-22 05:54:38 -08:00
Sage Moore 558db8083c [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095) 2025-02-22 05:25:41 -08:00
Kaixi Hou e109e598c7 [NVIDIA] Support nvfp4 cutlass gemm (#13571) 2025-02-22 05:24:05 -08:00
Keyun Tong 8db1b9d0a1 Support SSL Key Rotation in HTTP Server (#13495) 2025-02-22 05:17:44 -08:00
youkaichao 2382ad29d1 [ci] fix linter (#13701)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 20:28:59 +08:00
youkaichao 3e472d882a [core] set up data parallel communication (#13591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 19:28:59 +08:00
Cyrus Leung 7f6bae561c [CI/Build] Fix pre-commit errors (#13696) 2025-02-22 00:31:26 -08:00
Jee Jee Li 105b8ce4c0 [Misc] Reduce LoRA-related static variable (#13166) 2025-02-22 00:21:30 -08:00
Mark McLoughlin 2cb8c1540e [Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295) 2025-02-22 00:20:45 -08:00
Mark McLoughlin 1cd981da4f [V1][Metrics] Support vllm:cache_config_info (#13299) 2025-02-22 00:20:00 -08:00
Yu Chin Fabian Lim fca20841c2 Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size (#13660) 2025-02-22 00:19:10 -08:00
Jennifer Zhao da31b5333e [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler (#13594)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-02-22 00:08:29 -08:00
Lu Fang bb78fb318e [v1] Support allowed_token_ids in v1 Sampler (#13210)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-22 14:13:05 +08:00
Robin 8aca27fa11 [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len (#13691)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-02-22 14:10:38 +08:00
Dipika Sikka 95c617e04b [Misc] Bump compressed-tensors (#13619) 2025-02-21 22:09:04 -08:00
Shane A 9a1f1da5d1 [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA (#13687) 2025-02-21 22:07:45 -08:00
Gordon Wong 68d630a0c7 [ROCM] fix native attention function call (#13650) 2025-02-21 22:07:04 -08:00
Jun Duan 68d535ef44 [Misc] Capture and log the time of loading weights (#13666) 2025-02-21 22:06:34 -08:00
Robin c6ed93860f [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… (#13672) 2025-02-21 22:05:28 -08:00
Keyun Tong 0ffdf8ce0c [HTTP Server] Make model param optional in request (#13568) 2025-02-21 21:55:50 -08:00
Yuan Tang 8c0dd3d4df docs: Add a note on full CI run in contributing guide (#13646) 2025-02-21 21:53:59 -08:00
Isotr0py ada7c780d5 [Misc] Fix yapf linting tools etc not running on pre-commit (#13695)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-22 13:10:43 +08:00
Lucas Wilkinson 288cc6c234 [Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00
John Zheng 900edbfa48 fix typo of grafana dashboard, with correct datasource (#13668)
Signed-off-by: John Zheng <john.zheng@hp.com>
2025-02-21 18:21:05 +00:00
Isotr0py b2c3fc5d65 [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation (#13586) 2025-02-20 22:24:17 -08:00
leoneo 839b27c6cc [Kernel]Add streamK for block-quantized CUTLASS kernels (#12978) 2025-02-20 22:14:24 -08:00
Kevin H. Luu 34ad27fe83 [ci] Fix metrics test model path (#13635) 2025-02-20 22:12:10 -08:00
Gabriel Marinho 1c3c975766 [FEATURE] Enables /score endpoint for embedding models (#12846) 2025-02-20 22:09:47 -08:00
Szymon Ożóg 1cdc88614a Missing comment explaining VDR variable in GGUF kernels (#13290) 2025-02-20 22:06:54 -08:00
Nick Hill 31aa045c11 [V1][Sampler] Avoid an operation during temperature application (#13587) 2025-02-20 22:05:56 -08:00
Roger Wang a30c093502 [Bugfix] Add mm_processor_kwargs to chat-related protocols (#13644) 2025-02-20 22:04:33 -08:00
Harry Mellor c7b07a95a6 Use pre-commit to update requirements-test.txt (#13617) 2025-02-20 22:03:27 -08:00
Kaixi Hou 27a09dc52c [NVIDIA] Fix an issue to use current stream for the nvfp4 quant (#13632) 2025-02-20 22:01:48 -08:00
Edwin Hernandez 981f3c831e [Misc] Adding script to setup ray for multi-node vllm deployments (#12913) 2025-02-20 21:16:40 -08:00
Kante Yin 44c33f01f3 Add llmaz as another integration (#13643)
Signed-off-by: kerthcet <kerthcet@gmail.com>
2025-02-21 03:52:40 +00:00
Lingfan Yu 33170081f1 [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-20 17:45:45 -08:00