Commit Graph

61 Commits

Author SHA1 Message Date
Lucia Fang f525c0be8b [Model][Speculative Decoding] DeepSeek MTP spec decode (#12755)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-19 17:06:23 +08:00
shangmingc 46cdd59577 [Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-16 19:32:26 -08:00
Harry Mellor f2b20fe491 Consolidate Llama model usage in tests (#13094) 2025-02-13 22:18:03 -08:00
Thomas Parnell bb392af434 [Doc] Replace ibm-fms with ibm-ai-platform (#12709)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-02-04 07:05:04 +00:00
Russell Bryant e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
Nicolò Lucchesi 6116ca8cd7 [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
youkaichao 899136b857 [ci] fix broken distributed-tests-4-gpus (#11937)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-11 09:07:24 +08:00
jeongin601 1bf905ddaa [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198)
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
2024-11-27 05:07:30 +00:00
Murali Andoorveedu db66e018ea [Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
sroy745 f6778620a9 Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 (#10136)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2024-11-08 15:56:18 +00:00
Nicolò Lucchesi 9d43afcc53 [Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291)
Signed-off-by: NickLucche <nlucches@redhat.com>
2024-11-07 08:15:14 -08:00
wangshuai09 4e2d95e372 [Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Cody Yu d11bf435a0 [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kuntai Du 81ede99ca4 [Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
sroy745 f3a507f1d3 [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
sroy745 91add85ec4 Fix failing spec decode test (#9054) 2024-10-03 23:07:29 +00:00
Lily Liu 1570203864 [Spec Decode] (1/2) Remove batch expansion (#8839) 2024-10-01 16:04:42 -07:00
Travis Johnson 01b6f9e1f0 [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Lily Liu c6bd70d772 [SpecDec][Misc] Cleanup, remove bonus token logic. (#8701) 2024-09-22 12:34:14 -07:00
Lily Liu 775f00f81e [Speculative Decoding] Test refactor (#8317)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-11 14:07:34 -07:00
Travis Johnson cc0eaf12b1 [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Abhinav Goyal a3fce56b88 [Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) 2024-08-22 02:42:24 -07:00
Abhinav Goyal 312f761232 [Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) 2024-08-19 17:58:14 -07:00
shangmingc b67ae00cdb [Misc] Add quantization config support for speculative model. (#7343) 2024-08-15 19:34:28 -07:00
Wallas Henrique 70b746efcf [Misc] Deprecation Warning when setting --engine-use-ray (#7424)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
Travis Johnson 99b4cf5f23 [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Nick Hill 5cf9254a9c [BugFix] Fix use of per-request seed with pipeline parallel (#6698) 2024-07-30 10:40:08 -07:00
Nick Hill 2cf0df3381 [Bugfix] Fix speculative decode seeded test (#6743) 2024-07-24 08:58:31 -07:00
Nick Hill c882a7f5b3 [SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) 2024-07-24 07:34:22 +00:00
sroy745 14f91fe67c [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) 2024-07-20 23:58:58 -07:00
Thomas Parnell a5314e8698 [Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-19 07:15:22 -06:00
Woo-Yeon Lee a921e86392 [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) 2024-07-19 06:01:09 -07:00
Thomas Parnell d4201e06d5 [Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-07-18 19:22:08 -07:00
Alexander Matveev e76466dde2 [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 2024-07-17 14:30:28 -07:00
Cody Yu 160e1d8c99 [Misc] Log spec decode metrics (#6454) 2024-07-16 20:37:10 +00:00
Abhinav Goyal 2416b26e11 [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) 2024-07-09 18:34:02 -07:00
Swapnil Parekh 4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
Qubitium-ModelCloud ee93f4f92a [CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
Sirej Dua 15aba081f3 [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050)
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
xwjiang2010 98d6682cd1 [VLM] Remove image_input_type from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
sroy745 80ca1e6a3a [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 2024-07-01 00:33:05 -07:00
Thomas Parnell c2a8ac75e0 [CI/Build] Add E2E tests for MLPSpeculator (#5791)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-26 00:04:08 +00:00
Woo-Yeon Lee 2ce5d6688b [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) 2024-06-25 09:56:06 +00:00
zifeitong 78687504f7 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) 2024-06-19 13:57:12 -07:00
Cyrus Leung 0e9164b40a [mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Nick Hill 99dac099ab [Core][Doc] Default to multiprocessing for single-node distributed case (#5230)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
Cyrus Leung 7a64d24aad [Core] Support image processor (#4197) 2024-06-02 22:56:41 -07:00
Lily Liu d5a1697772 [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) 2024-05-25 10:00:14 -07:00
Alexei-V-Ivanov-AMD 943e72ca56 [Build/CI] Enabling AMD Entrypoints Test (#4834)
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
2024-05-20 11:29:28 -07:00
Cody Yu 973617ae02 [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00